INN Hotels Project¶

Marks : 60¶

Context¶

A significant number of hotel bookings are called off due to cancellations or no-shows. The typical reasons for cancellations include change of plans, scheduling conflicts, etc. This is often made easier by the option to do so free of charge or preferably at a low cost which is beneficial to hotel guests but it is a less desirable and possibly revenue-diminishing factor for hotels to deal with. Such losses are particularly high on last-minute cancellations.

The new technologies involving online booking channels have dramatically changed customers’ booking possibilities and behavior. This adds a further dimension to the challenge of how hotels handle cancellations, which are no longer limited to traditional booking and guest characteristics.

The cancellation of bookings impact a hotel on various fronts:

  1. Loss of resources (revenue) when the hotel cannot resell the room.
  2. Additional costs of distribution channels by increasing commissions or paying for publicity to help sell these rooms.
  3. Lowering prices last minute, so the hotel can resell a room, resulting in reducing the profit margin.
  4. Human resources to make arrangements for the guests.

Objective¶

The increasing number of cancellations calls for a Machine Learning based solution that can help in predicting which booking is likely to be canceled. INN Hotels Group has a chain of hotels in Portugal, they are facing problems with the high number of booking cancellations and have reached out to your firm for data-driven solutions. You as a data scientist have to analyze the data provided to find which factors have a high influence on booking cancellations, build a predictive model that can predict which booking is going to be canceled in advance, and help in formulating profitable policies for cancellations and refunds.

Data Description¶

The data contains the different attributes of customers' booking details. The detailed data dictionary is given below.

Data Dictionary

  • Booking_ID: unique identifier of each booking
  • no_of_adults: Number of adults
  • no_of_children: Number of Children
  • no_of_weekend_nights: Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel
  • no_of_week_nights: Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel
  • type_of_meal_plan: Type of meal plan booked by the customer:
    • Not Selected – No meal plan selected
    • Meal Plan 1 – Breakfast
    • Meal Plan 2 – Half board (breakfast and one other meal)
    • Meal Plan 3 – Full board (breakfast, lunch, and dinner)
  • required_car_parking_space: Does the customer require a car parking space? (0 - No, 1- Yes)
  • room_type_reserved: Type of room reserved by the customer. The values are ciphered (encoded) by INN Hotels.
  • lead_time: Number of days between the date of booking and the arrival date
  • arrival_year: Year of arrival date
  • arrival_month: Month of arrival date
  • arrival_date: Date of the month
  • market_segment_type: Market segment designation.
  • repeated_guest: Is the customer a repeated guest? (0 - No, 1- Yes)
  • no_of_previous_cancellations: Number of previous bookings that were canceled by the customer prior to the current booking
  • no_of_previous_bookings_not_canceled: Number of previous bookings not canceled by the customer prior to the current booking
  • avg_price_per_room: Average price per day of the reservation; prices of the rooms are dynamic. (in euros)
  • no_of_special_requests: Total number of special requests made by the customer (e.g. high floor, view from the room, etc)
  • booking_status: Flag indicating if the booking was canceled or not.

Import necessary Libraries and set style¶

In [1]:
# warnings issues
import warnings
warnings.filterwarnings("ignore")


from statsmodels.tools.sm_exceptions import ConvergenceWarning
warnings.simplefilter("ignore", ConvergenceWarning)

# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np

# libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Library to split data
from sklearn.model_selection import train_test_split

# To build model for prediction
import statsmodels.stats.api as sms
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm
from statsmodels.tools.tools import add_constant
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn import metrics

# To tune different models
from sklearn.model_selection import GridSearchCV



# To get diferent metric scores
from sklearn.metrics import (
    f1_score,
    accuracy_score,
    recall_score,
    precision_score,
    confusion_matrix,
    roc_auc_score,
    ConfusionMatrixDisplay,
    precision_recall_curve,
    roc_curve,
    make_scorer,
)
In [2]:
# Set standard styling for visualizations

# writer has slight vision deficiency, these features assist
custom_palette = sns.color_palette('colorblind')
sns.set(rc={'grid.color': 'gray', 'grid.alpha': 0.5})
sns.palplot(custom_palette)

# Paper context selected for readability in turned in .html format,
# but was originally written in noteebook context
sns.set(style='whitegrid', context='paper', palette= custom_palette)
In [3]:
# Set standard styling for charts and numeric displays

# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
pd.set_option('display.width', 1000)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
# setting the precision of floating numbers to 5 decimal points
pd.set_option("display.float_format", lambda x: "%.5f" % x)

Custom Functions Used in EDA portion of notebook¶

These have essentially entered my personal library from previous projects - much like I assume a company would have branding guidelines on their data visualizations. There is nothing really new here.

The new custom functions around model building are in that section.

In [4]:
def print_outliers_info(data, feature):
   """
   Calculates upper outlier analysis information to pair with visualizations

   data: dataframe
   feature: dataframe column
   """
   Q1 = data[feature].quantile(0.25)
   Q3 = data[feature].quantile(0.75)
   IQR = Q3 - Q1
   upper_bound = Q3 + 1.5 * IQR
   max_feat = data[feature].max()

   outliers = data[(data[feature] > upper_bound)][feature].unique()
   outliers_sorted = np.sort(outliers)

   if len(outliers_sorted) > 6:
       outliers_sorted_abbr = np.append(outliers_sorted[:6], "...etc")
   else:
       outliers_sorted_abbr = outliers_sorted

   if len(outliers) > 0:
       outlier_df = pd.DataFrame({
           "IQR": [IQR],
           "Q3": [Q3],
           "Upper Bound": [upper_bound],
           "Max" : [max_feat],
           "#rows > Upper Bound": [len(data[data[feature] > upper_bound])]
       })
       print(f"{feature} Outliers Information:\n")
       formatted_df = outlier_df.to_string(index=False, col_space=15, justify='left') + '\n'
       print(formatted_df)
       print(f"Unique Values Above Upper Bound: {outliers_sorted_abbr}")
In [5]:
def histogram_boxplot(data, feature, figsize=(15, 10), kde=False, bins=None):
    """
    Boxplot and histogram combined

    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (15,10))
    kde: whether to show the density curve (default False)
    bins: number of bins for histogram (default None)
    """

    # creating the 2 subplots
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,
        sharex=True,
        gridspec_kw={"height_ratios": (0.25, 0.75), "hspace": 0.05, "top": 0.95},
        figsize=figsize,
    )

    # create a title
    f2.suptitle(f"Histogram and Boxplot for {feature}", fontsize=16)

    # boxplot will be created and a square will indicate the mean value of the column
    create_boxplot(data, feature, ax_box2)

    # create histogram, with consideration of bins
    create_histogram(data, feature, ax_hist2, kde, bins)

    # Calculate and print outliers information for the specific feature
    print_outliers_info(data, feature)

def create_boxplot(data, feature, ax_box):
    sns.boxplot(
        data=data,
        x=feature,
        ax=ax_box,
        showmeans=True,
        meanprops={"marker": "s", "markersize": 8, "markerfacecolor": custom_palette[1], "markeredgecolor": "black"},
        medianprops={'linewidth': 4},
        color=custom_palette[2]
    )
    ax_box.set_xlabel("")

def create_histogram(data, feature, ax_hist, kde, bins):
    sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist, bins=bins, alpha=0.7
    ) if bins is not None else sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist, alpha=0.7
    )
    add_mean_median_to_histogram(data, feature, ax_hist)

def add_mean_median_to_histogram(data, feature, ax_hist):
    ax_hist.axvline(
        data[feature].mean(), color='black', linestyle='-', linewidth=8
    )
    ax_hist.axvline(
        data[feature].mean(), color=custom_palette[1], linestyle='-', linewidth=5, label="Mean"
    )
    ax_hist.axvline(
        data[feature].median(), color='black', linestyle='-', linewidth=5, label="Median"
    )
    ax_hist.legend(loc='upper right')
In [6]:
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None, rotation=90, sort_index=False):
    """
    Barplot with percentage at the top

    data: dataframe
    feature: dataframe column
    perc: whether to display percentages instead of count (default is False)
    n: displays the top n category levels (default is None, i.e., display all levels)
    sort_index: whether to sort the index (default is False)
    """
    # Check data type of the column
    if data[feature].dtype == 'O':
       print(f"Skipping outlier analysis for {feature} as it contains string values.")
    elif np.issubdtype(data[feature].dtype, np.number):  # Check if dtype is numeric
       # Perform some action for numeric dtype
       print(f"Performing numeric-specific action for {feature}")
       # Calculate and print outliers information for the specific feature
       print_outliers_info(data, feature)
    else:
       print(f"Unsupported dtype for {feature}.")

    # Plot the barplot
    plot_barplot(data, feature, perc, n, rotation, sort_index)

def plot_barplot(data, feature, perc, n, rotation, sort_index):
    total = len(data[feature])  # length of the column
    count = data[feature].nunique()
    if n is None:
        plt.figure(figsize=(count + 2, 6))
    else:
        plt.figure(figsize=(n + 2, 6))

    plt.xticks(rotation=rotation, fontsize=15)

    order = data[feature].value_counts().index[:n]
    if sort_index:
        order = sorted(order)

    ax = create_countplot(data, feature, order)

    for p in ax.patches:
        add_percentage_label(p, total, perc)

    # Add title
    plt.title(f"Barplot for {feature}")

    plt.show()  # show the plot

def create_countplot(data, feature, order):
    return sns.countplot(
        data=data,
        x=feature,
        palette=custom_palette,
        order=order,
    )

def add_percentage_label(p, total, perc):
    if perc:
        label = "{:.1f}%".format(
            100 * p.get_height() / total
        )  # percentage of each class of the category
    else:
        label = p.get_height()  # count of each level of the category

    x = p.get_x() + p.get_width() / 2  # width of the plot
    y = p.get_height()  # height of the plot

    plt.annotate(
        label,
        (x, y),
        ha="center",
        va="center",
        size=12,
        xytext=(0, 5),
        textcoords="offset points",
    )  # annotate the percentage
In [7]:
# This will be used as:
# dups_by_target(data, 'booking_status', 'Not_Canceled', 'Canceled')
# for this dataset
def dups_by_target(data, target, pos_value, neg_value):
    """
    Prints duplicate data for a given target column with positive and negative values.

    Parameters:
    - data: DataFrame
    - target: str, the column to analyze for duplicates
    - pos_value: str, the positive value for the target column
    - neg_value: str, the negative value for the target column
    """

    # Find duplicates for positive value
    pos_dupes = data[data[target] == pos_value].duplicated().sum()
    total_pos = len(data[data[target] == pos_value])
    pos_dupes_perc = (pos_dupes / total_pos) * 100
    pos_duplicates = data[data[target] == pos_value][data[data[target] == pos_value].duplicated(keep=False)]
    unique_sets_pos = pos_duplicates.groupby(list(pos_duplicates.columns)).size().reset_index(name='counts')

    # Find duplicates for negative value
    neg_dupes = data[data[target] == neg_value].duplicated().sum()
    total_neg = len(data[data[target] == neg_value])
    neg_dupes_perc = (neg_dupes / total_neg) * 100
    neg_duplicates = data[data[target] == neg_value][data[data[target] == neg_value].duplicated(keep=False)]
    unique_sets_neg = neg_duplicates.groupby(list(neg_duplicates.columns)).size().reset_index(name='counts')

    # Create DataFrames for positive and negative data
    pos_df = pd.DataFrame({
        "Status": [pos_value],
        "Duplicate Count": [pos_dupes],
        "Percentage": [f"{pos_dupes_perc:.2f}%"],
        "Unique Sets Count": [len(unique_sets_pos)],
        "Max Count in Sets": [unique_sets_pos["counts"].max()]
    })

    neg_df = pd.DataFrame({
        "Status": [neg_value],
        "Duplicate Count": [neg_dupes],
        "Percentage": [f"{neg_dupes_perc:.2f}%"],
        "Unique Sets Count": [len(unique_sets_neg)],
        "Max Count in Sets": [unique_sets_neg["counts"].max()]
    })

    # Concatenate the DataFrames
    summary_df = pd.concat([pos_df, neg_df], ignore_index=True)

    # Print the summary DataFrame
    print(summary_df)
In [8]:
def stacked_barplot(data, predictor, target, rotation=90, sort_columns=True):
    """
    Print the category counts and plot a stacked bar chart

    data: dataframe
    predictor: independent variable
    target: target variable
    rotation: rotation angle for x-axis labels (default is 90)
    sort_columns: whether to sort columns by values in the predictor/x-axis (default is True)
    """
    count = data[predictor].nunique()
    sorter = data[target].value_counts().index[-1]
    tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
        by=sorter, ascending=False
    )
    print(tab1)
    print("-" * 120)

    if sort_columns:
        tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
            by=sorter, ascending=False
        )
    else:
        tab = pd.crosstab(data[predictor], data[target], normalize="index")

    tab.plot(kind="bar", stacked=True, figsize=(count + 5, 5))
    plt.legend(
        loc="lower left", frameon=False,
    )
    plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
    plt.xticks(rotation=rotation, fontsize=15)
    plt.show()
In [9]:
### function to plot distributions wrt target
def distribution_plot_wrt_target(data, predictor, target):

    fig, axs = plt.subplots(2, 2, figsize=(12, 10))

    target_uniq = data[target].unique()

    axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
    sns.histplot(
        data=data[data[target] == target_uniq[0]],
        x=predictor,
        kde=True,
        ax=axs[0, 0],
        color=custom_palette[2],
        alpha=0.7,
        stat="density",
        line_kws={"color": "black"}
    )

    axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
    sns.histplot(
        data=data[data[target] == target_uniq[1]],
        x=predictor,
        kde=True,
        ax=axs[0, 1],
        color=custom_palette[3],
        alpha=0.7,
        stat="density",
        line_kws={"color": "black"}
    )

    axs[1, 0].set_title("Boxplot w.r.t target")
    sns.boxplot(
        data=data,
        x=target,
        y=predictor,
        ax=axs[1, 0]
    )

    axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
    sns.boxplot(
        data=data,
        x=target,
        y=predictor,
        ax=axs[1, 1],
        showfliers=False,
    )

    plt.tight_layout()
    plt.show()

Import Dataset & Basic Overview¶

In [10]:
# read the data
from google.colab import files
import io

try:
    uploaded
except NameError:
    uploaded = files.upload()

hotel= pd.read_csv(io.BytesIO(uploaded['INNHotelsGroup.csv']))
Upload widget is only available when the cell has been executed in the current browser session. Please rerun this cell to enable.
Saving INNHotelsGroup.csv to INNHotelsGroup.csv
In [13]:
# copying data to another variable to avoid any changes to original data
data = hotel.copy()

View the first and last 5 rows of the dataset¶

In [14]:
data.head() ##  view top 5 rows of the data
Out[14]:
Booking_ID no_of_adults no_of_children no_of_weekend_nights no_of_week_nights type_of_meal_plan required_car_parking_space room_type_reserved lead_time arrival_year arrival_month arrival_date market_segment_type repeated_guest no_of_previous_cancellations no_of_previous_bookings_not_canceled avg_price_per_room no_of_special_requests booking_status
0 INN00001 2 0 1 2 Meal Plan 1 0 Room_Type 1 224 2017 10 2 Offline 0 0 0 65.00000 0 Not_Canceled
1 INN00002 2 0 2 3 Not Selected 0 Room_Type 1 5 2018 11 6 Online 0 0 0 106.68000 1 Not_Canceled
2 INN00003 1 0 2 1 Meal Plan 1 0 Room_Type 1 1 2018 2 28 Online 0 0 0 60.00000 0 Canceled
3 INN00004 2 0 0 2 Meal Plan 1 0 Room_Type 1 211 2018 5 20 Online 0 0 0 100.00000 0 Canceled
4 INN00005 2 0 1 1 Not Selected 0 Room_Type 1 48 2018 4 11 Online 0 0 0 94.50000 0 Canceled
In [15]:
data.tail() ##  view last 5 rows of the data
Out[15]:
Booking_ID no_of_adults no_of_children no_of_weekend_nights no_of_week_nights type_of_meal_plan required_car_parking_space room_type_reserved lead_time arrival_year arrival_month arrival_date market_segment_type repeated_guest no_of_previous_cancellations no_of_previous_bookings_not_canceled avg_price_per_room no_of_special_requests booking_status
36270 INN36271 3 0 2 6 Meal Plan 1 0 Room_Type 4 85 2018 8 3 Online 0 0 0 167.80000 1 Not_Canceled
36271 INN36272 2 0 1 3 Meal Plan 1 0 Room_Type 1 228 2018 10 17 Online 0 0 0 90.95000 2 Canceled
36272 INN36273 2 0 2 6 Meal Plan 1 0 Room_Type 1 148 2018 7 1 Online 0 0 0 98.39000 2 Not_Canceled
36273 INN36274 2 0 0 3 Not Selected 0 Room_Type 1 63 2018 4 21 Online 0 0 0 94.50000 0 Canceled
36274 INN36275 2 0 1 2 Meal Plan 1 0 Room_Type 1 207 2018 12 30 Offline 0 0 0 161.67000 0 Not_Canceled

Understand the shape and types of the dataset¶

In [16]:
data.shape ##  view dimensions of the data
Out[16]:
(36275, 19)

Check the data types of the columns for the dataset¶

In [17]:
data.info() ##  view datatypes for each column, preliminary missing value check
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36275 entries, 0 to 36274
Data columns (total 19 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   Booking_ID                            36275 non-null  object 
 1   no_of_adults                          36275 non-null  int64  
 2   no_of_children                        36275 non-null  int64  
 3   no_of_weekend_nights                  36275 non-null  int64  
 4   no_of_week_nights                     36275 non-null  int64  
 5   type_of_meal_plan                     36275 non-null  object 
 6   required_car_parking_space            36275 non-null  int64  
 7   room_type_reserved                    36275 non-null  object 
 8   lead_time                             36275 non-null  int64  
 9   arrival_year                          36275 non-null  int64  
 10  arrival_month                         36275 non-null  int64  
 11  arrival_date                          36275 non-null  int64  
 12  market_segment_type                   36275 non-null  object 
 13  repeated_guest                        36275 non-null  int64  
 14  no_of_previous_cancellations          36275 non-null  int64  
 15  no_of_previous_bookings_not_canceled  36275 non-null  int64  
 16  avg_price_per_room                    36275 non-null  float64
 17  no_of_special_requests                36275 non-null  int64  
 18  booking_status                        36275 non-null  object 
dtypes: float64(1), int64(13), object(5)
memory usage: 5.3+ MB

Duplicate Check¶

In [18]:
# checking for duplicate values
data.duplicated().sum()
Out[18]:
0

Let's drop the Booking_ID column first before we proceed forward.

In [19]:
data = data.drop(['Booking_ID'], axis = 1) ## Drop the Booking_ID column from the dataframe
In [20]:
dups_by_target(data, 'booking_status', 'Not_Canceled', 'Canceled')
         Status  Duplicate Count Percentage  Unique Sets Count  Max Count in Sets
0  Not_Canceled             5832     23.91%               2093                 91
1      Canceled             4443     37.38%               1045                 83

This is not super surprising, as hotels will have the same room type with very common booking arrangements. It is good to know going in, that of the Canceled data, ~37% of it is in duplicated data, with at least one unique set being quite large at 83 duplicated rows. We can check this again after outlier work to see if the number of duplicated values increases substantially.

In [21]:
data.head()
Out[21]:
no_of_adults no_of_children no_of_weekend_nights no_of_week_nights type_of_meal_plan required_car_parking_space room_type_reserved lead_time arrival_year arrival_month arrival_date market_segment_type repeated_guest no_of_previous_cancellations no_of_previous_bookings_not_canceled avg_price_per_room no_of_special_requests booking_status
0 2 0 1 2 Meal Plan 1 0 Room_Type 1 224 2017 10 2 Offline 0 0 0 65.00000 0 Not_Canceled
1 2 0 2 3 Not Selected 0 Room_Type 1 5 2018 11 6 Online 0 0 0 106.68000 1 Not_Canceled
2 1 0 2 1 Meal Plan 1 0 Room_Type 1 1 2018 2 28 Online 0 0 0 60.00000 0 Canceled
3 2 0 0 2 Meal Plan 1 0 Room_Type 1 211 2018 5 20 Online 0 0 0 100.00000 0 Canceled
4 2 0 1 1 Not Selected 0 Room_Type 1 48 2018 4 11 Online 0 0 0 94.50000 0 Canceled

Exploratory Data Analysis + Data Preprocessing¶

Summary of ALL EDA Observations and Feature Preprocessing¶

The EDA sections below were kept in this location instead of in an appendix due to the amount of outlier treatment conducted as part of the EDA. This section is meant to be a complete summary such that opening the rest of the subsections of the EDA are only as needed.

  1. Booking_ID:

    • Removed after confirming no duplicate IDs.
  2. Booking Status:

    • About 2/3 of bookings are not canceled.
    • Encoded Canceled bookings as 1 and Not_Canceled as 0.
    • Cancellations correlate most with increased lead time, but also with less special requests, costs, and certain markets see more cancellations.
    • No Outlier treatment
  3. Number of Adults:

    • 72% of bookings are for 2 adults.
    • Some rooms have 0 adults, potentially students. Valid data point if allowed by the hotel.
    • No Outlier treatment
  4. Number of Children:

    • 93% of bookings had 0 children.
    • Encoded # of children greater than 3 to be 3 children
  5. Number of Weekend Nights:

    • 79% of bookings had between 1-3 weekend nights.
    • Encoded stays longer than 6 weekdays into 6 weekday to represent 2 calendar week stay
  6. Number of Week Nights:

    • 46.5% of bookings had 0 weekend nights.
    • Encoded days 6 and 7 in feature as 5 days for a 3+ calendar weekend stay.
  7. Type of Meal Plan:

    • 76.7% of bookings had Meal Plan 1.
    • Only 5 bookings had Meal Plan 3.
    • No Outlier Treatment
  8. Required Car Parking Space:

    • 96.9% of bookings did not require parking.
    • No Outlier Treatment
  9. Room Type Reserved:

    • 77.5% of bookings had Room_Type_1.
    • No Outlier Treatment, maintained categorical variable
  10. Lead Time:

    • Most within two months.
    • Encoded all outliers > 365 to 365 days - to represent bookings with one year or more lead time
    • Highest correlation value with booking status of all features. Larger lead times correlate with more cancellations.
  11. Arrival Year and Month:

    • Arrival month distribution varies.
    • October has the highest count at 14.7%.
    • No Outlier Treatment or Feature Engineering
  12. Arrival Date:

    • Date of arrival.
    • Left for now, but fully expect to remove later in Model Performance Steps
  13. Market Segment Type:

    • 64% online bookings, 29% offline bookings.
    • Corporate is only 5.6% of market despite being the majority of previous cancellations.
    • Online bookings have the highest rate of cancelations, and complimentary rooms are never canceled.
    • No Outlier Treatment
  14. Repeated Guest:

    • 97.4% of bookings are new customers.
    • 2.6% repeated guests, including 118 guests who booked before but only ever canceled.
    • Those 2.6% guests that are repeating very rarely cancel. Those 118 previous cancels are effectively re-bookings. Which is not a label we have in this data set - it might be nice!
  15. Previous Cancellations and Bookings Not Canceled:

    • Majority had 0 prior cancellations.
    • Most previous cancellations were from Corporate; most data set cancellations were from online.
    • Stong Correlations between these two features and the Repeated Guests.
    • Slight Feature Engineering within section - turned these into categorical features in a seperate dataframe. However, the Repeated Guests feature seems a good summary due to correlations.
  16. Average Price per Room:

    • Average around 100 Euros.
    • Encoded prices > 261.69 Euros - to limit scale, but not count, of outliers. This value was the upper bound of the distribution of the original outliers - so the outliers of the outliers were encoded.
    • There is a slight correlation such that increase in average price of room leads to increased cancellations.
  1. Number of Special Requests:
    • 54.5% of bookings had zero requests.
    • Upper bound was 2.5; all values greater were encoded to 3 requests.
    • Only notable negative correlation in data set, -0.25 with booking status

Let's check the statistical summary of the data.

In [22]:
data.describe().T ##  print the statistical summary of the data
Out[22]:
count mean std min 25% 50% 75% max
no_of_adults 36275.00000 1.84496 0.51871 0.00000 2.00000 2.00000 2.00000 4.00000
no_of_children 36275.00000 0.10528 0.40265 0.00000 0.00000 0.00000 0.00000 10.00000
no_of_weekend_nights 36275.00000 0.81072 0.87064 0.00000 0.00000 1.00000 2.00000 7.00000
no_of_week_nights 36275.00000 2.20430 1.41090 0.00000 1.00000 2.00000 3.00000 17.00000
required_car_parking_space 36275.00000 0.03099 0.17328 0.00000 0.00000 0.00000 0.00000 1.00000
lead_time 36275.00000 85.23256 85.93082 0.00000 17.00000 57.00000 126.00000 443.00000
arrival_year 36275.00000 2017.82043 0.38384 2017.00000 2018.00000 2018.00000 2018.00000 2018.00000
arrival_month 36275.00000 7.42365 3.06989 1.00000 5.00000 8.00000 10.00000 12.00000
arrival_date 36275.00000 15.59700 8.74045 1.00000 8.00000 16.00000 23.00000 31.00000
repeated_guest 36275.00000 0.02564 0.15805 0.00000 0.00000 0.00000 0.00000 1.00000
no_of_previous_cancellations 36275.00000 0.02335 0.36833 0.00000 0.00000 0.00000 0.00000 13.00000
no_of_previous_bookings_not_canceled 36275.00000 0.15341 1.75417 0.00000 0.00000 0.00000 0.00000 58.00000
avg_price_per_room 36275.00000 103.42354 35.08942 0.00000 80.30000 99.45000 120.00000 540.00000
no_of_special_requests 36275.00000 0.61966 0.78624 0.00000 0.00000 0.00000 1.00000 5.00000

Individual Feature Analysis¶

Observations on booking status¶

Lets start with the Target Variable. We will encode Canceled bookings to 1 and Not_Canceled as 0 when we prep for modeling, but lets keep it categorical for analysis.

In [23]:
labeled_barplot(data, "booking_status", perc= True, rotation = 0)
Skipping outlier analysis for booking_status as it contains string values.

As we look at outliers below, if they can keep this approximate 2/3 Not_Canceled and 1/3 Canceled Status that is great.

Observations on lead time¶

In [24]:
histogram_boxplot(data, 'lead_time')
lead_time Outliers Information:

 IQR             Q3              Upper Bound     Max             #rows > Upper Bound
109.00000       126.00000       289.50000       443             1331                

Unique Values Above Upper Bound: ['290' '291' '292' '293' '294' '295' '...etc']

I am unconcerned about the 0 values; that is people showing up to the hotels because the vacancy sign is on or otherwise change in travel plans. The upper outliers represent 3.6% of the data so lets pull some of those in. It looks like the density of outliers drops off around 350, so lets round up to a full year of 365 days. This creates a data bucket of "greater than one year lead time".

In [25]:
data.loc[data["lead_time"] >= 365, "lead_time"] = 365

Observations on average price per room¶

In [26]:
histogram_boxplot(data, 'avg_price_per_room', bins = 25)
avg_price_per_room Outliers Information:

 IQR             Q3              Upper Bound     Max             #rows > Upper Bound
39.70000        120.00000       179.55000       540.00000       1069                

Unique Values Above Upper Bound: ['179.71' '179.92' '180.0' '180.16' '180.2' '180.25' '...etc']

The outliers on both ends are worth investigation. The 0 Euro values for some rooms might be exlplained by toom type or market segment, while we may need to decide on a specific value for the upper bound for outliers. The one room over 500 can go, but what about everything else?

In [27]:
print_outliers_info(data, "avg_price_per_room")
avg_price_per_room Outliers Information:

 IQR             Q3              Upper Bound     Max             #rows > Upper Bound
39.70000        120.00000       179.55000       540.00000       1069                

Unique Values Above Upper Bound: ['179.71' '179.92' '180.0' '180.16' '180.2' '180.25' '...etc']

Lets look at the distribution of data above the upper bound of 179, this will help us determine a "cut off point" that has the correct impact on analysis.

In [28]:
price_upper_bound = 179.55
price_over_ubound = data[data['avg_price_per_room'] >= price_upper_bound]
In [29]:
histogram_boxplot(price_over_ubound, 'avg_price_per_room')
avg_price_per_room Outliers Information:

 IQR             Q3              Upper Bound     Max             #rows > Upper Bound
29.86000        216.90000       261.69000       540.00000       42                  

Unique Values Above Upper Bound: ['262.7' '263.55' '263.91' '264.1' '265.0' '265.44' '...etc']

I see that the low code notebook suggests only cutting out the one data point greater than 500 Euros. Howevere, since the 1069 outlier room prices is 3% of the data set, and since I lack a subject matter expert, I am instead choosing to replace more outliers than that. Looking at the distribution of outliers, I will instead choose to replace all values above 261.69 (which is the upper bound of the outlier distribution).

In [30]:
# assigning the outliers with the value of 261.69
price_upper_bound2 = 261.69
data.loc[data["avg_price_per_room"] >= price_upper_bound2, "avg_price_per_room"] = price_upper_bound2

Now lets take a look at the 0 - values to assess validty/sanity.

In [31]:
data[data["avg_price_per_room"] == 0]
Out[31]:
no_of_adults no_of_children no_of_weekend_nights no_of_week_nights type_of_meal_plan required_car_parking_space room_type_reserved lead_time arrival_year arrival_month arrival_date market_segment_type repeated_guest no_of_previous_cancellations no_of_previous_bookings_not_canceled avg_price_per_room no_of_special_requests booking_status
63 1 0 0 1 Meal Plan 1 0 Room_Type 1 2 2017 9 10 Complementary 0 0 0 0.00000 1 Not_Canceled
145 1 0 0 2 Meal Plan 1 0 Room_Type 1 13 2018 6 1 Complementary 1 3 5 0.00000 1 Not_Canceled
209 1 0 0 0 Meal Plan 1 0 Room_Type 1 4 2018 2 27 Complementary 0 0 0 0.00000 1 Not_Canceled
266 1 0 0 2 Meal Plan 1 0 Room_Type 1 1 2017 8 12 Complementary 1 0 1 0.00000 1 Not_Canceled
267 1 0 2 1 Meal Plan 1 0 Room_Type 1 4 2017 8 23 Complementary 0 0 0 0.00000 1 Not_Canceled
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
35983 1 0 0 1 Meal Plan 1 0 Room_Type 7 0 2018 6 7 Complementary 1 4 17 0.00000 1 Not_Canceled
36080 1 0 1 1 Meal Plan 1 0 Room_Type 7 0 2018 3 21 Complementary 1 3 15 0.00000 1 Not_Canceled
36114 1 0 0 1 Meal Plan 1 0 Room_Type 1 1 2018 3 2 Online 0 0 0 0.00000 0 Not_Canceled
36217 2 0 2 1 Meal Plan 1 0 Room_Type 2 3 2017 8 9 Online 0 0 0 0.00000 2 Not_Canceled
36250 1 0 0 2 Meal Plan 2 0 Room_Type 1 6 2017 12 10 Online 0 0 0 0.00000 0 Not_Canceled

545 rows × 18 columns

In [32]:
data.loc[data["avg_price_per_room"] == 0, "market_segment_type"].value_counts()
Out[32]:
Complementary    354
Online           191
Name: market_segment_type, dtype: int64

The zero value avg price per rooms are all comps, or possibly online comps or card rewards. These are valid data points so we will leave them as is!

Lets look at our final distribution:

In [33]:
histogram_boxplot(data, 'avg_price_per_room', bins = 25)
avg_price_per_room Outliers Information:

 IQR             Q3              Upper Bound     Max             #rows > Upper Bound
39.70000        120.00000       179.55000       261.69000       1069                

Unique Values Above Upper Bound: ['179.71' '179.92' '180.0' '180.16' '180.2' '180.25' '...etc']

The number of outliers did not change, simply the magnitude of the outlier-ness of them. I think this will affect the analysis correctly; pulling the distribution a little closer to normal and honoring that there are "high dollar rooms". I anticipate that , at least in the giant tree, there will be a price node. I can watch to see what that price is to see if my outlier treatment was too agreesive.

Observations on number of previous booking cancellations¶

In [34]:
histogram_boxplot(data, 'no_of_previous_cancellations')
no_of_previous_cancellations Outliers Information:

 IQR             Q3              Upper Bound     Max             #rows > Upper Bound
0.00000         0.00000         0.00000         13              338                 

Unique Values Above Upper Bound: ['1' '2' '3' '4' '5' '6' '...etc']

No removal of outliers here - all these values are valid. There is simply a large numbr of 0 previous cacnelations. Lets look at some patterns in the > 0 cancellation data.

In [35]:
data.loc[data['no_of_previous_cancellations'] > 0, "market_segment_type"].value_counts()
Out[35]:
Corporate        176
Offline           68
Online            55
Complementary     36
Aviation           3
Name: market_segment_type, dtype: int64

Most of previous cancellations are from the corporate sector

In [36]:
data.loc[data['booking_status'] == "Canceled", "market_segment_type"].value_counts()
Out[36]:
Online       8475
Offline      3153
Corporate     220
Aviation       37
Name: market_segment_type, dtype: int64

Wheras the data point status cancellations are mostly from online bookings.

Observations on number of previous booking not canceled¶

In [37]:
histogram_boxplot(data, 'no_of_previous_bookings_not_canceled')
no_of_previous_bookings_not_canceled Outliers Information:

 IQR             Q3              Upper Bound     Max             #rows > Upper Bound
0.00000         0.00000         0.00000         58              812                 

Unique Values Above Upper Bound: ['1' '2' '3' '4' '5' '6' '...etc']

Similarly will leave these outliers. The 58 prior not cancellations is a frequent travelor or company and they are an important feature of this model.

Lets look at the Booking Status split for users that are new customers (such that both previous cancellation and not canceled are both 0)

In [38]:
data.loc[(data['no_of_previous_cancellations'] == 0) & (data['no_of_previous_bookings_not_canceled'] == 0),"booking_status"].value_counts(normalize = True)
Out[38]:
Not_Canceled   0.66420
Canceled       0.33580
Name: booking_status, dtype: float64

This is essentially a miniature decision tree - one that may end up being relevatnt to our final model or nnot. This is slight feature engineering in a new dataframe, will decide later to carry over to primary dataframe or not.

In [39]:
data2 = data.copy()

# Add new columns that show previous status as yes/no instead of counts
data2['previously_cancelled'] = np.where(data2['no_of_previous_cancellations'] > 0, 'yes', 'no')
data2['previous_stayed'] = np.where(data2['no_of_previous_bookings_not_canceled'] > 0, 'yes', 'no')

# Group by previously_cancelled and previous_stayed, then count by booking_status
result = data2.groupby(['previously_cancelled', 'previous_stayed', 'booking_status']).size().reset_index(name='count')
print(result)
  previously_cancelled previous_stayed booking_status  count
0                   no              no       Canceled  11869
1                   no              no   Not_Canceled  23476
2                   no             yes   Not_Canceled    592
3                  yes              no       Canceled      9
4                  yes              no   Not_Canceled    109
5                  yes             yes       Canceled      7
6                  yes             yes   Not_Canceled    213
In [40]:
# Group by previously_cancelled and previous_stayed, then count by booking_status
result = data2.groupby(['booking_status','previously_cancelled', 'previous_stayed']).size().reset_index(name='count')
print(result)
  booking_status previously_cancelled previous_stayed  count
0       Canceled                   no              no  11869
1       Canceled                  yes              no      9
2       Canceled                  yes             yes      7
3   Not_Canceled                   no              no  23476
4   Not_Canceled                   no             yes    592
5   Not_Canceled                  yes              no    109
6   Not_Canceled                  yes             yes    213

This shows us that 1/3 of new customer bookings (in charts as no/no for previous statuses) have cancelled.

Further, all 592 customers who are (no cancellations / yes returning customer) do not cancel. That sounds like a potentially pure node, but that size of data set is small.

Observations on repeated guests¶

Belatedly realizing this column exists, but will keep the previous analysis in the cancelled vs not cancelled section. Yes = 1 and No = 0 in this data. This column counts a repeated guest as anyone who has booked before, even if they cancelled. So the above info might actually be stronger.

In [41]:
labeled_barplot(data, "repeated_guest", perc=True, rotation = 0, sort_index= True)
Performing numeric-specific action for repeated_guest
repeated_guest Outliers Information:

 IQR             Q3              Upper Bound     Max             #rows > Upper Bound
0.00000         0.00000         0.00000         1               930                 

Unique Values Above Upper Bound: [1]
In [42]:
# Group by previously_cancelled and previous_stayed, then count by booking_status
result = data2.groupby(['booking_status','previously_cancelled', 'previous_stayed', 'repeated_guest']).size().reset_index(name='count')
print(result)
  booking_status previously_cancelled previous_stayed  repeated_guest  count
0       Canceled                   no              no               0  11869
1       Canceled                  yes              no               1      9
2       Canceled                  yes             yes               1      7
3   Not_Canceled                   no              no               0  23476
4   Not_Canceled                   no             yes               1    592
5   Not_Canceled                  yes              no               1    109
6   Not_Canceled                  yes             yes               1    213

Observations on number of adults¶

In [43]:
labeled_barplot(data, "no_of_adults", perc=True, rotation = 0, sort_index= True)
Performing numeric-specific action for no_of_adults
no_of_adults Outliers Information:

 IQR             Q3              Upper Bound     Max             #rows > Upper Bound
0.00000         2.00000         2.00000         4               2333                

Unique Values Above Upper Bound: [3 4]

The "outliers" of number of adults are 2 customer lines, so will leave them for now.

Observations on number of children¶

In [44]:
labeled_barplot(data, "no_of_children", perc=True, rotation = 0, sort_index= True)
Performing numeric-specific action for no_of_children
no_of_children Outliers Information:

 IQR             Q3              Upper Bound     Max             #rows > Upper Bound
0.00000         0.00000         0.00000         10              2698                

Unique Values Above Upper Bound: [ 1  2  3  9 10]
In [45]:
data.loc[(data['no_of_children'] > 2),"booking_status"].value_counts()
Out[45]:
Not_Canceled    16
Canceled         6
Name: booking_status, dtype: int64

We can see that the 2/3 Not_Cancelled rate holds for these 22 values, so replace the 9 and 10 children values with 3s. This creates a "greater than 2" category.

In [46]:
# treate values of 3, 9, and 10 as "greater than 2 children"
data["no_of_children"] = data["no_of_children"].replace([9, 10], 3)

Observations on number of week nights¶

In [47]:
labeled_barplot(data, 'no_of_week_nights',  perc=True, rotation = 0, sort_index= True)
Performing numeric-specific action for no_of_week_nights
no_of_week_nights Outliers Information:

 IQR             Q3              Upper Bound     Max             #rows > Upper Bound
2.00000         3.00000         6.00000         17              324                 

Unique Values Above Upper Bound: ['7' '8' '9' '10' '11' '12' '...etc']

The 0 weeknight category is the customers who only stay on weekends (Saturday or Sunday).

The 6+ category of data is customers who stay into two calendar weeks, across a weekened. Lets combine all the 6+ data into the 6.

In [48]:
# treat all customers who stayed over two calendar weeks as the same
data["no_of_week_nights"] = data["no_of_week_nights"].apply(lambda x: min(x, 6))

Lets check the assumption that these 6+ weekday stays include a weekend:

In [49]:
data[data["no_of_week_nights"] == 6].groupby("no_of_weekend_nights").size().reset_index(name='count')
Out[49]:
no_of_weekend_nights count
0 2 246
1 3 100
2 4 112
3 5 34
4 6 20
5 7 1

This looks good to me; all of these customers have at least 2 weekend nights in their stays. We can confirm that the 6+ weekday stay customers are all multi calendar week stay customers.

Observations on number of weekend nights¶

In [50]:
labeled_barplot(data, 'no_of_weekend_nights',  perc=True, rotation = 0, sort_index= True)
Performing numeric-specific action for no_of_weekend_nights
no_of_weekend_nights Outliers Information:

 IQR             Q3              Upper Bound     Max             #rows > Upper Bound
2.00000         2.00000         5.00000         7               21                  

Unique Values Above Upper Bound: [6 7]
In [51]:
# treat all customers who stayed over three calendar weekends the same
data["no_of_weekend_nights"] = data["no_of_weekend_nights"].apply(lambda x: min(x, 5))

Observations on required car parking space¶

In [52]:
labeled_barplot(data, 'required_car_parking_space',  perc=True, rotation = 0, sort_index= True)
Performing numeric-specific action for required_car_parking_space
required_car_parking_space Outliers Information:

 IQR             Q3              Upper Bound     Max             #rows > Upper Bound
0.00000         0.00000         0.00000         1               1124                

Unique Values Above Upper Bound: [1]

Observations on type of meal plan¶

In [53]:
labeled_barplot(data, 'type_of_meal_plan',  perc=True, rotation = 45)
Skipping outlier analysis for type_of_meal_plan as it contains string values.
In [54]:
data['type_of_meal_plan'].value_counts()
Out[54]:
Meal Plan 1     27835
Not Selected     5130
Meal Plan 2      3305
Meal Plan 3         5
Name: type_of_meal_plan, dtype: int64

Meal plan 3 is closest to Meal plan 2. But Ill leave it as a seperate categorical for now. With the value so small, I can't imagine it will appear in the final tree unless its "Meal Plan 1" vs "Not Meal Plan 1".

Observations on room type reserved¶

In [55]:
labeled_barplot(data, 'room_type_reserved',  perc=True, rotation = 45)
Skipping outlier analysis for room_type_reserved as it contains string values.

Observations on arrival month - Booking Status changed to 1/0 notation here¶

In [56]:
labeled_barplot(data, 'arrival_month',  perc=True, rotation = 0, sort_index= True)
Performing numeric-specific action for arrival_month
In [57]:
# grouping the data on arrival months and extracting the count of bookings
monthly_data = data.groupby(["arrival_month"])["booking_status"].count()

# creating a dataframe with months and count of customers in each month
monthly_data = pd.DataFrame(
    {"Month": list(monthly_data.index), "Guests": list(monthly_data.values)}
)

# plotting the trend over different months
plt.figure(figsize=(10, 5))
sns.lineplot(data=monthly_data, x="Month", y="Guests")
plt.show()

These effectively communicate the same thing, but we rarely get to use lineplots in this class so it was nice to include.

In [58]:
#Lets start this feature building a little early so we can see cancellations by month as well
data["booking_status"] = data["booking_status"].apply(
    lambda x: 1 if x == "Canceled" else 0
)
In [59]:
stacked_barplot(data, "arrival_month", "booking_status", rotation = 0, sort_columns = False) ## Complete the code to plot stacked barplot for arrival month and booking status
booking_status      0      1    All
arrival_month                      
All             24390  11885  36275
10               3437   1880   5317
9                3073   1538   4611
8                2325   1488   3813
7                1606   1314   2920
6                1912   1291   3203
4                1741    995   2736
5                1650    948   2598
11               2105    875   2980
3                1658    700   2358
2                1274    430   1704
12               2619    402   3021
1                 990     24   1014
------------------------------------------------------------------------------------------------------------------------

There more cancelations in the Northern Hemisphere "Summer Months" while companies are often lenient on leave and the children are largely nto in school.

Observations on market segment type¶

In [60]:
labeled_barplot(data, "market_segment_type", rotation = 45,  perc=True)
Skipping outlier analysis for market_segment_type as it contains string values.

Observations on number of special requests¶

In [61]:
labeled_barplot(data, 'no_of_special_requests',  perc=True, rotation = 0, sort_index= True)
Performing numeric-specific action for no_of_special_requests
no_of_special_requests Outliers Information:

 IQR             Q3              Upper Bound     Max             #rows > Upper Bound
1.00000         1.00000         2.50000         5               761                 

Unique Values Above Upper Bound: [3 4 5]

Outlier Treatment; combine the 3,4,and 5 sepecial requests bookings. This is because the upper bound is 2.5 and relatively small percentage of data points are beyond that point.

In [62]:
data.loc[data["no_of_special_requests"] > 3, "no_of_special_requests"] = 3

Additional Bivariate Analysis¶

In [63]:
cols_list = data.select_dtypes(include=np.number).columns.tolist()

plt.figure(figsize=(12, 7))
sns.heatmap(
    data[cols_list].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="cividis"
)
plt.show()

There are not any surprises in this correlation matrix. There is additional work below to look at the smaller correlation values, but here are the big ones:

  • Positive Correlations

    • Previous_Bookings_Not_Cancelled, Previous_Cancellations, and Repeated_Guest all have correlations above 0.39. This is to be expected from the univariate analysis; Repeated_Guests is essentially the combination of the other two categories.
    • Lead_Time and Booking_Status have a positive correlation of 0.44. This seems like it will be a major contributer.
    • Average_Price and No_of_Children have a correlation of 0.35
    • Average_Price and No_of_Adults have a correlation of 0.30
  • Negative Correlations

    • The greatest negative correlation is -0.25 between booking_status and no_special_requests.

Hotel rates are dynamic and change according to demand and customer demographics. Let's see how prices vary across different market segments

In [64]:
plt.figure(figsize=(10, 6))
sns.boxplot(
    data=data, x="market_segment_type", y="avg_price_per_room"
)
plt.show()
  • Complementary rooms are the most noisy, random in pricing. Aviation is most likely flight crews, who would have deals with hotels. Their small spread makes sense.
  • The spread for the Offline and Online bookings are similar, with the Online being slightly more expensive. This makes sense as an additional feature must be payed for in order for online bookings to exist.

Let's see how booking status varies across different market segments. Also, how average price per room impacts booking status

In [65]:
# Reminder that "1" here is a Cancelation since that is the behavior we are modeling
stacked_barplot(data, "market_segment_type", "booking_status", rotation = 45)
booking_status           0      1    All
market_segment_type                     
All                  24390  11885  36275
Online               14739   8475  23214
Offline               7375   3153  10528
Corporate             1797    220   2017
Aviation                88     37    125
Complementary          391      0    391
------------------------------------------------------------------------------------------------------------------------
  • As we saw earlier, Online bookings are most likely to cancel.
  • Complementary bookings never cancel in this data set

Many guests have special requirements when booking a hotel room. Let's see how it impacts cancellations

In [66]:
stacked_barplot(data, "no_of_special_requests", "booking_status", rotation = 0)
booking_status              0      1    All
no_of_special_requests                     
All                     24390  11885  36275
0                       11232   8545  19777
1                        8670   2703  11373
2                        3727    637   4364
3                         761      0    761
------------------------------------------------------------------------------------------------------------------------

The correlation matrix predicted this; a negative correlation between special requests and cancellations (where cancellations = 1). As the number of requests increased, so did likely hood of Not Cancelling. This seems to indicate a level of investment.

Let's see if the special requests made by the customers impacts the prices of a room

In [67]:
plt.figure(figsize=(10, 5))
sns.boxplot(data = data, x = 'no_of_special_requests', y = 'avg_price_per_room')  ## Complete the code to create boxplot for no of special requests and average price per room (excluding the outliers)
plt.show()

We saw earlier that there is a positive correlation between booking status and average price per room. Let's analyze it

In [68]:
distribution_plot_wrt_target(data, "avg_price_per_room", "booking_status")

We see that the average price per room is slightly greater in the case of cancellations.

There is a positive correlation between booking status and lead time also. Let's analyze it further

In [69]:
distribution_plot_wrt_target(data, 'lead_time', 'booking_status') ## Complete the code to find distribution of lead time wrt booking status

Greater lead times generally correlate with more cancellations.

Generally people travel with their spouse and children for vacations or other activities. Let's create a new dataframe of the customers who traveled with their families and analyze the impact on booking status.

In [70]:
family_data = data[(data["no_of_children"] >= 0) & (data["no_of_adults"] > 1)]
family_data.shape
Out[70]:
(28441, 18)
In [71]:
pd.options.mode.chained_assignment = None  # default='warn'
family_data.loc[:, "no_of_family_members"] = family_data["no_of_adults"] + family_data["no_of_children"]
In [72]:
stacked_barplot(family_data, 'no_of_family_members', 'booking_status', rotation = 0, sort_columns = False) ## Complete the code to plot stacked barplot for no of family members and booking status
booking_status            0     1    All
no_of_family_members                    
All                   18456  9985  28441
2                     15506  8213  23719
3                      2425  1368   3793
4                       514   398    912
5                        11     6     17
------------------------------------------------------------------------------------------------------------------------

There are slightly more cancelations as the family increases from 2 to 4 people, but then a family of 5 and a family of 2 have similar rates.

Let's do a similar analysis for the customer who stay for at least a day at the hotel.

In [73]:
stay_data = data[(data["no_of_week_nights"] > 0) & (data["no_of_weekend_nights"] > 0)]
stay_data.shape
Out[73]:
(17094, 18)
In [74]:
stay_data["total_days"] = (
    stay_data["no_of_week_nights"] + stay_data["no_of_weekend_nights"]
)
In [75]:
stacked_barplot(stay_data, "total_days", "booking_status", rotation = 0, sort_columns = False) ## Complete the code to plot stacked barplot for total days and booking status
booking_status      0     1    All
total_days                        
All             10979  6115  17094
3                3689  2183   5872
4                2977  1387   4364
5                1593   738   2331
2                1301   639   1940
6                 566   465   1031
7                 590   383    973
8                 157   142    299
10                 36    76    112
9                  61    56    117
11                  9    46     55
------------------------------------------------------------------------------------------------------------------------

Generally, as the number of days increases there is an increase in cancellations, but truly the pattern isnt consistent until total days is greater than 8. This is including the outlier work we did earlier; such that the 11 total days is truncated.

Repeating guests are the guests who stay in the hotel often and are important to brand equity. Let's see what percentage of repeating guests cancel?

In [76]:
stacked_barplot(data, "repeated_guest", "booking_status", rotation = 0, sort_columns = False) ## Complete the code to plot stacked barplot for repeated guests and booking status
booking_status      0      1    All
repeated_guest                     
All             24390  11885  36275
0               23476  11869  35345
1                 914     16    930
------------------------------------------------------------------------------------------------------------------------

We have seen this a few ways; generally repeated guests are not cancelleing.

As hotel room prices are dynamic, Let's see how the prices vary across different months

In [77]:
plt.figure(figsize=(10, 5))
sns.lineplot(data, x= 'arrival_month', y = 'avg_price_per_room') ## Complete the code to create lineplot between average price per room and arrival month
plt.show()

The "summer months" in the northern hemisphere is from June to August. This is when most students are not in school - it makes sense that there is increased travel and thus higher prices in those months.

In [78]:
pd.crosstab(data['market_segment_type'], data['room_type_reserved']).plot(kind='bar', stacked=True)
plt.show()

Final Sanity Checks¶

These boxplots for outlier detection are a little choppy as compared to the analysis above, but included for thoroughness. There are clearly still some outliers, but I am confident in the previous treatments.

In [79]:
# outlier detection using boxplot
numeric_columns = data.select_dtypes(include=np.number).columns.tolist()
# dropping booking_status
numeric_columns.remove("booking_status")

plt.figure(figsize=(15, 12))

for i, variable in enumerate(numeric_columns):
    plt.subplot(4, 4, i + 1)
    plt.boxplot(data[variable], whis=1.5)
    plt.tight_layout()
    plt.title(variable)

plt.show()

As we said at the start, lets take one last look at duplicates to see if they increased with the outlier treatments:

In [80]:
dups_by_target(data, 'booking_status', 0,1)
   Status  Duplicate Count Percentage  Unique Sets Count  Max Count in Sets
0       0             5833     23.92%               2093                 91
1       1             4443     37.38%               1045                 83

We actually do not see much change in this data! That means the outlier treatments did not create new "duplicate reservation types". That might mean that outlier work was pointless, but lets carry on.

Model Building¶

Goal: To reduce losses¶

  • Hotel would want F1 Score to be maximized, greater the F1 score higher are the chances of minimizing False Negatives and False Positives.

Custom Functions¶

First, let's create functions to calculate different metrics and confusion matrix so that we don't have to use the same code repeatedly for each model.

  • The model_performance_classification_statsmodels function will be used to check the model performance of models.
  • The confusion_matrix_statsmodels function will be used to plot the confusion matrix.
In [81]:
# defining a function to compute different metrics to check performance of a classification model built using statsmodels
def model_performance_classification_statsmodels(
    model, predictors, target, threshold=0.5
):
    """
    Function to compute different metrics to check classification model performance

    model: classifier
    predictors: independent variables
    target: dependent variable
    threshold: threshold for classifying the observation as class 1
    """

    # checking which probabilities are greater than threshold
    pred_temp = model.predict(predictors) > threshold
    # rounding off the above values to get classes
    pred = np.round(pred_temp)

    acc = accuracy_score(target, pred)  # to compute Accuracy
    recall = recall_score(target, pred)  # to compute Recall
    precision = precision_score(target, pred)  # to compute Precision
    f1 = f1_score(target, pred)  # to compute F1-score

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
        index=[0],
    )

    return df_perf
In [82]:
# defining a function to plot the confusion_matrix of a classification model

def confusion_matrix_statsmodels(model, predictors, target, threshold=0.5):
    """
    To plot the confusion_matrix with percentages

    model: classifier
    predictors: independent variables
    target: dependent variable
    threshold: threshold for classifying the observation as class 1
    """
    y_pred = model.predict(predictors) > threshold
    cm = confusion_matrix(target, y_pred)
    labels = np.asarray(
        [
            ["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
            for item in cm.flatten()
        ]
    ).reshape(2, 2)

    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=labels, fmt="")
    plt.ylabel("True label")
    plt.xlabel("Predicted label")
In [83]:
def treating_multicollinearity(predictors, target, high_vif_columns):
    """
    Checking the effect of dropping the columns showing high multicollinearity
    on model performance (adj. R-squared and RMSE)

    predictors: independent variables
    target: dependent variable
    high_vif_columns: columns having high VIF
    """
    # empty lists to store adj. R-squared and RMSE values
    adj_r2 = []
    rmse = []

    # build ols models by dropping one of the high VIF columns at a time
    # store the adjusted R-squared and RMSE in the lists defined previously
    for cols in high_vif_columns:
        # defining the new train set
        train = predictors.loc[:, ~predictors.columns.str.startswith(cols)]

        # create the model
        olsmodel = sm.OLS(target, train).fit()

        # adding adj. R-squared and RMSE to the lists
        adj_r2.append(olsmodel.rsquared_adj)
        rmse.append(np.sqrt(olsmodel.mse_resid))

    # creating a dataframe for the results
    temp = pd.DataFrame(
        {
            "col": high_vif_columns,
            "Adj. R-squared after_dropping col": adj_r2,
            "RMSE after dropping col": rmse,
        }
    ).sort_values(by="Adj. R-squared after_dropping col", ascending=False)
    temp.reset_index(drop=True, inplace=True)

    return temp

1) Logistic Regression¶

Data Preparation for Logistic Regression¶

In [84]:
# Create Independent and Dependent Variables
X = data.drop(["booking_status"], axis=1)
Y = data["booking_status"]

# Add a constant to X
X = sm.add_constant(X)

# Create dummy variables for categorical columns in X
X = pd.get_dummies(X, drop_first=True)

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=1)
In [85]:
# This is a quick check to make sure that our class distribution is equal across the train and test sets

print("Shape of Training set : ", X_train.shape)
print("Shape of test set : ", X_test.shape)
print("Percentage of classes in training set:")
print(y_train.value_counts(normalize=True))
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize=True))
Shape of Training set :  (25392, 28)
Shape of test set :  (10883, 28)
Percentage of classes in training set:
0   0.67064
1   0.32936
Name: booking_status, dtype: float64
Percentage of classes in test set:
0   0.67638
1   0.32362
Name: booking_status, dtype: float64

Multicollinearity¶

In [86]:
# we will define a function to check VIF
def checking_vif(predictors):
    vif = pd.DataFrame()
    vif["feature"] = predictors.columns

    # calculating VIF for each feature
    vif["VIF"] = [
        variance_inflation_factor(predictors.values, i)
        for i in range(len(predictors.columns))
    ]
    return vif
In [87]:
vif_df = checking_vif(X_train).sort_values(by = "VIF")
vif_df
Out[87]:
feature VIF
19 room_type_reserved_Room_Type 3 1.00330
9 arrival_date 1.00669
16 type_of_meal_plan_Meal Plan 3 1.02524
21 room_type_reserved_Room_Type 5 1.02825
5 required_car_parking_space 1.04036
3 no_of_weekend_nights 1.05479
4 no_of_week_nights 1.09185
18 room_type_reserved_Room_Type 2 1.10609
23 room_type_reserved_Room_Type 7 1.11625
14 no_of_special_requests 1.25137
15 type_of_meal_plan_Meal Plan 2 1.27242
17 type_of_meal_plan_Not Selected 1.27535
8 arrival_month 1.27656
1 no_of_adults 1.35365
20 room_type_reserved_Room_Type 4 1.36521
11 no_of_previous_cancellations 1.39560
6 lead_time 1.39953
7 arrival_year 1.43187
12 no_of_previous_bookings_not_canceled 1.65182
10 repeated_guest 1.78366
22 room_type_reserved_Room_Type 6 2.05651
13 avg_price_per_room 2.07307
2 no_of_children 2.09439
24 market_segment_type_Complementary 4.50122
25 market_segment_type_Corporate 16.91748
26 market_segment_type_Offline 64.08484
27 market_segment_type_Online 71.16080
0 const 39496591.94636

The only high VIFs are in the categoricals. This is because certain Markets only reserve certain room types. This will drp away when we do preformance checks.

Building Logistic Regression Model¶

In [88]:
# Fitting logistic regression model
logit = sm.Logit(y_train, X_train.astype(float)) #increased the interation due to p-value work
lg = logit.fit(disp = False)

# Printing summary of the model
print(lg.summary())
                           Logit Regression Results                           
==============================================================================
Dep. Variable:         booking_status   No. Observations:                25392
Model:                          Logit   Df Residuals:                    25364
Method:                           MLE   Df Model:                           27
Date:                Sat, 30 Sep 2023   Pseudo R-squ.:                  0.3291
Time:                        09:14:54   Log-Likelihood:                -10796.
converged:                      False   LL-Null:                       -16091.
Covariance Type:            nonrobust   LLR p-value:                     0.000
========================================================================================================
                                           coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------------------------
const                                 -922.7827    120.975     -7.628      0.000   -1159.889    -685.676
no_of_adults                             0.1119      0.038      2.968      0.003       0.038       0.186
no_of_children                           0.1587      0.062      2.555      0.011       0.037       0.280
no_of_weekend_nights                     0.1140      0.020      5.803      0.000       0.076       0.153
no_of_week_nights                        0.0156      0.014      1.153      0.249      -0.011       0.042
required_car_parking_space              -1.5980      0.138    -11.600      0.000      -1.868      -1.328
lead_time                                0.0158      0.000     59.114      0.000       0.015       0.016
arrival_year                             0.4561      0.060      7.608      0.000       0.339       0.574
arrival_month                           -0.0419      0.006     -6.467      0.000      -0.055      -0.029
arrival_date                             0.0005      0.002      0.280      0.779      -0.003       0.004
repeated_guest                          -2.3610      0.618     -3.817      0.000      -3.573      -1.149
no_of_previous_cancellations             0.2658      0.086      3.094      0.002       0.097       0.434
no_of_previous_bookings_not_canceled    -0.1724      0.153     -1.128      0.259      -0.472       0.127
avg_price_per_room                       0.0189      0.001     25.514      0.000       0.017       0.020
no_of_special_requests                  -1.4709      0.030    -48.892      0.000      -1.530      -1.412
type_of_meal_plan_Meal Plan 2            0.1735      0.067      2.607      0.009       0.043       0.304
type_of_meal_plan_Meal Plan 3           27.2852    1.6e+05      0.000      1.000   -3.13e+05    3.13e+05
type_of_meal_plan_Not Selected           0.2753      0.053      5.183      0.000       0.171       0.379
room_type_reserved_Room_Type 2          -0.3620      0.131     -2.757      0.006      -0.619      -0.105
room_type_reserved_Room_Type 3          -0.0182      1.314     -0.014      0.989      -2.593       2.557
room_type_reserved_Room_Type 4          -0.2783      0.053     -5.226      0.000      -0.383      -0.174
room_type_reserved_Room_Type 5          -0.7202      0.209     -3.439      0.001      -1.131      -0.310
room_type_reserved_Room_Type 6          -0.9472      0.151     -6.262      0.000      -1.244      -0.651
room_type_reserved_Room_Type 7          -1.3507      0.292     -4.627      0.000      -1.923      -0.779
market_segment_type_Complementary      -28.1577    1.6e+05     -0.000      1.000   -3.13e+05    3.13e+05
market_segment_type_Corporate           -1.2256      0.264     -4.634      0.000      -1.744      -0.707
market_segment_type_Offline             -2.2291      0.253     -8.811      0.000      -2.725      -1.733
market_segment_type_Online              -0.4277      0.250     -1.713      0.087      -0.917       0.062
========================================================================================================
In [89]:
print("Training performance:")
model_performance_classification_statsmodels(lg, X_train, y_train)
Training performance:
Out[89]:
Accuracy Recall Precision F1
0 0.80541 0.63219 0.73923 0.68153
In [90]:
confusion_matrix_statsmodels(lg, X_train, y_train)

Model 0 Observations¶

  • The large negative coefficient on Complementary matches our earleir analysis that Complementary rooms do not cancel
  • On the other extreme, Meal_plan_3s have a high cancellation rate
  • the p-values indicate if the varaiables are significant or not - we see several values with high p-values that we will drop.
  • Our stated goal was to have as high a F1 value as possible - we want to limit both False-Negatives (because the customer then shows up and the hotel is not prepared) and False-Positives (because the hotel will lose money on empty rooms or lowered flash prices to get last minute bookings. We have a lot of room for improvement so far, with almost 20% of the data in those two categories.

Removing high p-value variables¶

In [91]:
# initial list of columns
predictors = X_train.copy()
cols = predictors.columns.tolist()

# setting an initial max p-value
max_p_value = 1

while len(cols) > 0:
    # defining the train set
    X_train_aux = predictors[cols]

    # fitting the model
    model = sm.OLS(y_train, X_train_aux).fit()

    # getting the p-values and the maximum p-value
    p_values = model.pvalues
    max_p_value = max(p_values)

    # name of the variable with maximum p-value
    feature_with_p_max = p_values.idxmax()

    if max_p_value > 0.05:
        cols.remove(feature_with_p_max)
    else:
        break

selected_features = cols
print(selected_features)
['const', 'no_of_adults', 'no_of_children', 'no_of_weekend_nights', 'required_car_parking_space', 'lead_time', 'arrival_year', 'arrival_month', 'no_of_previous_bookings_not_canceled', 'avg_price_per_room', 'no_of_special_requests', 'type_of_meal_plan_Not Selected', 'room_type_reserved_Room_Type 2', 'room_type_reserved_Room_Type 4', 'room_type_reserved_Room_Type 5', 'room_type_reserved_Room_Type 6', 'room_type_reserved_Room_Type 7', 'market_segment_type_Complementary', 'market_segment_type_Corporate', 'market_segment_type_Offline']

^ That is the list of features that are staying after that analysis.

In [92]:
X_train1 = X_train[selected_features]
X_test1 = X_test[selected_features]
In [93]:
logit1 = sm.Logit(y_train, X_train1)
lg1 = logit1.fit(disp=False) #cant get google collab to increase iterations, so it is stopping at 35
print(lg1.summary())
                           Logit Regression Results                           
==============================================================================
Dep. Variable:         booking_status   No. Observations:                25392
Model:                          Logit   Df Residuals:                    25372
Method:                           MLE   Df Model:                           19
Date:                Sat, 30 Sep 2023   Pseudo R-squ.:                  0.3279
Time:                        09:14:56   Log-Likelihood:                -10814.
converged:                      False   LL-Null:                       -16091.
Covariance Type:            nonrobust   LLR p-value:                     0.000
========================================================================================================
                                           coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------------------------
const                                 -861.2730    116.634     -7.384      0.000   -1089.871    -632.675
no_of_adults                             0.1082      0.037      2.898      0.004       0.035       0.181
no_of_children                           0.1569      0.062      2.529      0.011       0.035       0.279
no_of_weekend_nights                     0.1174      0.020      6.014      0.000       0.079       0.156
required_car_parking_space              -1.6078      0.138    -11.678      0.000      -1.878      -1.338
lead_time                                0.0160      0.000     61.092      0.000       0.015       0.016
arrival_year                             0.4254      0.058      7.360      0.000       0.312       0.539
arrival_month                           -0.0444      0.006     -6.931      0.000      -0.057      -0.032
no_of_previous_bookings_not_canceled    -0.6621      0.213     -3.115      0.002      -1.079      -0.246
avg_price_per_room                       0.0194      0.001     27.254      0.000       0.018       0.021
no_of_special_requests                  -1.4694      0.030    -49.011      0.000      -1.528      -1.411
type_of_meal_plan_Not Selected           0.2700      0.053      5.109      0.000       0.166       0.374
room_type_reserved_Room_Type 2          -0.3674      0.131     -2.796      0.005      -0.625      -0.110
room_type_reserved_Room_Type 4          -0.2803      0.053     -5.314      0.000      -0.384      -0.177
room_type_reserved_Room_Type 5          -0.7280      0.209     -3.489      0.000      -1.137      -0.319
room_type_reserved_Room_Type 6          -0.9758      0.151     -6.479      0.000      -1.271      -0.681
room_type_reserved_Room_Type 7          -1.3927      0.292     -4.777      0.000      -1.964      -0.821
market_segment_type_Complementary      -26.9785   1.19e+05     -0.000      1.000   -2.33e+05    2.33e+05
market_segment_type_Corporate           -0.8522      0.103     -8.299      0.000      -1.053      -0.651
market_segment_type_Offline             -1.7764      0.051    -35.122      0.000      -1.876      -1.677
========================================================================================================

Model 1 - lg1 now has no multicollinearity and only significant values. Lets analyze this again:

Model 1 Observations¶

Converting coefficients to odds¶

In [94]:
# converting coefficients to odds
odds = np.exp(lg1.params)

# finding the percentage change
perc_change_odds = (np.exp(lg1.params) - 1) * 100

# removing limit from number of columns to display
pd.set_option("display.max_columns", None)

# adding the odds to a dataframe
odds_df = pd.DataFrame({"Odds": odds, "Change_odd%": perc_change_odds}, index=X_train1.columns).T
print(odds_df)
                 const  no_of_adults  no_of_children  no_of_weekend_nights  required_car_parking_space  lead_time  arrival_year  arrival_month  no_of_previous_bookings_not_canceled  avg_price_per_room  no_of_special_requests  type_of_meal_plan_Not Selected  room_type_reserved_Room_Type 2  room_type_reserved_Room_Type 4  room_type_reserved_Room_Type 5  room_type_reserved_Room_Type 6  room_type_reserved_Room_Type 7  market_segment_type_Complementary  market_segment_type_Corporate  market_segment_type_Offline
Odds           0.00000       1.11423         1.16990               1.12460                     0.20033    1.01610       1.53022        0.95653                               0.51576             1.01961                 0.23007                         1.30999                         0.69256                         0.75555                         0.48290                         0.37690                         0.24839                            0.00000                        0.42648                      0.16925
Change_odd% -100.00000      11.42258        16.99034              12.46016                   -79.96676    1.61039      53.02158       -4.34691                             -48.42417             1.96139               -76.99322                        30.99875                       -30.74447                       -24.44507                       -51.71047                       -62.30984                       -75.16054                         -100.00000                      -57.35221                    -83.07512
In [95]:
top6_columns = odds_df.abs().sort_values(by="Change_odd%", axis=1, ascending=False).iloc[:, :6]

print(top6_columns)
                const  market_segment_type_Complementary  market_segment_type_Offline  required_car_parking_space  no_of_special_requests  room_type_reserved_Room_Type 7
Odds          0.00000                            0.00000                      0.16925                     0.20033                 0.23007                         0.24839
Change_odd% 100.00000                          100.00000                     83.07512                    79.96676                76.99322                        75.16054

Six was chosen simply because the constant will be one of the colums - I am actually interested in the top 5 features that change the odds so I can better compare to the Decision Tree Visualization.

Checking model performance on the training set¶

In [96]:
# creating confusion matrix
conf_matrix_default = confusion_matrix_statsmodels(lg1, X_train1, y_train)
In [97]:
# checking model performance on train set (seen 70% data)
print("Training Performance - Model 0")
lg_perf_train = model_performance_classification_statsmodels(lg, X_train, y_train)
print(lg_perf_train)

# checking model performance on test set (seen 30% data)
print("Test Performance- Model 0")
lg_perf_test = model_performance_classification_statsmodels(lg, X_test, y_test)
print(lg_perf_test,"\n\n")

# checking model performance on train set (seen 70% data)
print("Training Performance - Model 1")
lg1_perf_train1 = model_performance_classification_statsmodels(lg1, X_train1, y_train)
print(lg1_perf_train1)

# checking model performance on test set (seen 30% data)
print("Test Performance- Model 1")
lg_perf_test1= model_performance_classification_statsmodels(lg1, X_test1, y_test)
print(lg_perf_test1)
Training Performance - Model 0
   Accuracy  Recall  Precision      F1
0   0.80541 0.63219    0.73923 0.68153
Test Performance- Model 0
   Accuracy  Recall  Precision      F1
0   0.80428 0.63061    0.72820 0.67590 


Training Performance - Model 1
   Accuracy  Recall  Precision      F1
0   0.80415 0.62920    0.73759 0.67910
Test Performance- Model 1
   Accuracy  Recall  Precision      F1
0   0.80401 0.62890    0.72838 0.67500

The improvement is in the consistency in the performance across train and test. In otherwords, we are no longer overfitting to the train data. I hope we can continue to improve that F1 score though, as 67% is also the rate of cancellation in the original data set.

Select Threshold to maximize F1¶

ROC-AUC¶

  • ROC-AUC on training set
In [98]:
logit_roc_auc_train = roc_auc_score(y_train, lg1.predict(X_train1))
fpr, tpr, thresholds = roc_curve(y_train, lg1.predict(X_train1))
plt.figure(figsize=(7, 5))
plt.plot(fpr, tpr, label="Logistic Regression (area = %0.2f)" % logit_roc_auc_train)
plt.plot([0, 1], [0, 1], "r--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.01])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver operating characteristic")
plt.legend(loc="lower right")
plt.show()

AUC-ROC curve¶

In [99]:
# Optimal threshold as per AUC-ROC curve
# The optimal cut off would be where tpr is high and fpr is low
fpr, tpr, thresholds = roc_curve(y_train, lg1.predict(X_train1))

optimal_idx = np.argmax(tpr - fpr)
optimal_threshold_auc_roc = thresholds[optimal_idx]
print(optimal_threshold_auc_roc)
0.3832236744031388
In [100]:
# creating confusion matrix
conf_matrix_roc = confusion_matrix_statsmodels(
    lg1, X_train1, y_train, threshold=optimal_threshold_auc_roc
)
In [101]:
# checking model performance for this model
log_reg_model_train_perf_threshold_auc_roc = model_performance_classification_statsmodels(
    lg1, X_train1, y_train, threshold=optimal_threshold_auc_roc
)
print("Training performance AUC:")
log_reg_model_train_perf_threshold_auc_roc
Training performance AUC:
Out[101]:
Accuracy Recall Precision F1
0 0.79620 0.72701 0.67766 0.70147

This shifted more of the error into the False positive - which in this case means we are over-predicting cancellations. The Accuracy and Precision decreased, byt the Recall and F1 increased.

Precision-Recall Curve¶

In [102]:
y_scores = lg1.predict(X_train1)
prec, rec, tre = precision_recall_curve(y_train, y_scores,)


def plot_prec_recall_vs_tresh(precisions, recalls, thresholds):
    plt.plot(thresholds, precisions[:-1], "b--", label="precision")
    plt.plot(thresholds, recalls[:-1], "g--", label="recall")
    plt.xlabel("Threshold")
    plt.legend(loc="upper left")
    plt.ylim([0, 1])


plt.figure(figsize=(10, 7))
plot_prec_recall_vs_tresh(prec, rec, tre)
plt.show()
In [103]:
# I had to google this, I wanted to find the exact intersection.
def find_threshold_for_intersect(precisions, recalls, thresholds):
    for i in range(len(precisions) - 1):
        if precisions[i] == recalls[i]:
            return thresholds[i]

# Find the threshold for intersection
intersect_threshold = find_threshold_for_intersect(prec, rec, tre)

print("Intersection Threshold:", intersect_threshold)
Intersection Threshold: 0.4222295680330016

Truly I need to minimize both, but operationally not having enough rooms for customers is a larger issue then rooms being empty (not that airlines agree with that assessment). Perfect Intersection would be 0.42, so lets compare that to the AUC-ROC threshold of 0.38

Compare the performance of the three thresholds:¶

In [104]:
print("PR Threshold:  0.42 ")
conf_matrix_pr = confusion_matrix_statsmodels(lg1, X_train1, y_train, threshold=intersect_threshold)

log_reg_model_train_perf_threshold_curve = model_performance_classification_statsmodels(
    lg1, X_train1, y_train, threshold=intersect_threshold
)
log_reg_model_train_perf_threshold_curve
PR Threshold:  0.42 
Out[104]:
Accuracy Recall Precision F1
0 0.80041 0.69688 0.69705 0.69696
In [105]:
print("ROC-AUC Threshold: 0.38")
conf_matrix_roc = confusion_matrix_statsmodels(lg1, X_train1, y_train, threshold=optimal_threshold_auc_roc)

log_reg_model_train_perf_threshold_curve = model_performance_classification_statsmodels(
    lg1, X_train1, y_train, threshold=optimal_threshold_auc_roc
)
log_reg_model_train_perf_threshold_curve
ROC-AUC Threshold: 0.38
Out[105]:
Accuracy Recall Precision F1
0 0.79620 0.72701 0.67766 0.70147
In [106]:
print("Default Threshold: 0.50")
conf_matrix_default = confusion_matrix_statsmodels(lg1, X_train1, y_train)

log_reg_model_train_perf_threshold_curve = model_performance_classification_statsmodels(
     lg1, X_train1, y_train
)
log_reg_model_train_perf_threshold_curve
Default Threshold: 0.50
Out[106]:
Accuracy Recall Precision F1
0 0.80415 0.62920 0.73759 0.67910

The Threshold that maximizes the F1 value, which we originally said was the goal would be the ROC-AUC curve threshold value of 0.38.

However, this is a bit of a tough call, and I would certainly want to talk to a SME about the balance of False Negative and False Positives. This threshold of 0.38 maximizes the F1, but ist also minimizes precision. The False Positive rate is 11.39% with this threshold. In striking the balance between F1 and not overbooking your hotel, I could see the default threshold being selected instead. For the purposes of this assignment, I will select the threshold of 0.38 but also advice taking the risk of overbooking into account.

In [107]:
# setting the threshold
optimal_threshold_curve = optimal_threshold_auc_roc

Model Performance on Test Set¶

In [108]:
log_reg_model_test_perf = model_performance_classification_statsmodels(
    lg1, X_test1, y_test, threshold = optimal_threshold_curve
)

print("Test performance:")
log_reg_model_test_perf
Test performance:
Out[108]:
Accuracy Recall Precision F1
0 0.79721 0.72743 0.67262 0.69895
In [109]:
print("Test performance:")
confusion_matrix_statsmodels(lg1, X_test1, y_test, threshold=optimal_threshold_auc_roc)
Test performance:

2) Decision Tree¶

Tree 1 : Initial Decision¶

Building Decision Tree Model¶
In [110]:
# rebuild the data, since this is a different model

# Create Independent and Dependent Variables
X = data.drop(["booking_status"], axis=1)
Y = data["booking_status"]

# Add a constant to X
X = sm.add_constant(X)

# Create dummy variables for categorical columns in X
X = pd.get_dummies(X, drop_first=True)

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=1)
In [111]:
model = DecisionTreeClassifier(criterion="gini", random_state=1)
model.fit(X_train, y_train)
Out[111]:
DecisionTreeClassifier(random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(random_state=1)
Checking model performance on training set¶
In [112]:
tree1_confmat_train = confusion_matrix_statsmodels(model, X_train, y_train)
In [113]:
tree1_perf_train = model_performance_classification_statsmodels(
    model, X_train, y_train
)
tree1_perf_train
Out[113]:
Accuracy Recall Precision F1
0 0.99421 0.98661 0.99578 0.99117
Checking model performance on test set¶
In [114]:
tree1_confmat_test = confusion_matrix_statsmodels(model, X_test, y_test)
In [115]:
tree1_perf_test = decision_tree_perf_train = model_performance_classification_statsmodels(
    model, X_train, y_train
)
tree1_perf_test
Out[115]:
Accuracy Recall Precision F1
0 0.99421 0.98661 0.99578 0.99117

This is VERY high F1. Not goign to visualize yet - though Ill admit I tried. After a runtime of 3 minutes I interupted it and moved on.

We can look at the important features, however:

In [116]:
feature_names = list(X_train.columns)
importances = model.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()

Many of these are not important, so time to reduce overfitting and prune the tree.

Tree 2 : Hyperparameter Pre-Pruning¶

Pruning the tree¶

Pre-Pruning

In [117]:
# Choose the type of classifier.
estimator = DecisionTreeClassifier(random_state=1, class_weight="balanced")

# Grid of parameters to choose from
parameters = {
    "max_depth": np.arange(2, 7, 1),
    "max_leaf_nodes": [2, 3, 5, 10, 50, 75, 150, 250],
    "min_samples_split": [10, 30, 50, 70],
}

# Type of scoring used to compare parameter combinations
acc_scorer = make_scorer(f1_score)

# Run the grid search
grid_obj = GridSearchCV(estimator, parameters, scoring=acc_scorer, cv=5)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
estimator = grid_obj.best_estimator_

# Fit the best algorithm to the data.
estimator.fit(X_train, y_train)
Out[117]:
DecisionTreeClassifier(class_weight='balanced', max_depth=6, max_leaf_nodes=50,
                       min_samples_split=10, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(class_weight='balanced', max_depth=6, max_leaf_nodes=50,
                       min_samples_split=10, random_state=1)
Checking performance on training set¶
In [118]:
tree2_confmat_train = confusion_matrix_statsmodels(estimator, X_train, y_train)
In [119]:
tree2_perf_train = decision_tree_perf_train = model_performance_classification_statsmodels(
    estimator, X_train, y_train
)
tree2_perf_train
Out[119]:
Accuracy Recall Precision F1
0 0.83101 0.78620 0.72428 0.75397

The new F1 of 0.75 is greater than the regression model, and drastically less than the unpruned tree.

Checking performance on test set¶
In [120]:
tree2_confmat_test = confusion_matrix_statsmodels(estimator, X_test, y_test)
In [121]:
tree2_perf_test = decision_tree_perf_train = model_performance_classification_statsmodels(
    estimator, X_test, y_test
)
tree2_perf_test
Out[121]:
Accuracy Recall Precision F1
0 0.83497 0.78336 0.72758 0.75444

Good Performance on Test as well, and not a large change from the train.

Visualizing the Decision Tree¶
In [122]:
plt.figure(figsize=(20, 10))
out = tree.plot_tree(
    estimator,
    feature_names=feature_names,
    filled=True,
    fontsize=9,
    node_ids=False,
    class_names=None,
)
# below code will add arrows to the decision tree split if they are missing
for o in out:
    arrow = o.arrow_patch
    if arrow is not None:
        arrow.set_edgecolor("black")
        arrow.set_linewidth(1)
plt.show()

This is still a little onweildy.

In [123]:
# Text report showing the rules of a decision tree -
print(tree.export_text(estimator, feature_names=feature_names, show_weights=True))
|--- lead_time <= 151.50
|   |--- no_of_special_requests <= 0.50
|   |   |--- market_segment_type_Online <= 0.50
|   |   |   |--- lead_time <= 90.50
|   |   |   |   |--- no_of_weekend_nights <= 0.50
|   |   |   |   |   |--- avg_price_per_room <= 196.50
|   |   |   |   |   |   |--- weights: [1736.39, 132.08] class: 0
|   |   |   |   |   |--- avg_price_per_room >  196.50
|   |   |   |   |   |   |--- weights: [0.75, 25.81] class: 1
|   |   |   |   |--- no_of_weekend_nights >  0.50
|   |   |   |   |   |--- lead_time <= 68.50
|   |   |   |   |   |   |--- weights: [960.27, 223.16] class: 0
|   |   |   |   |   |--- lead_time >  68.50
|   |   |   |   |   |   |--- weights: [129.73, 160.92] class: 1
|   |   |   |--- lead_time >  90.50
|   |   |   |   |--- lead_time <= 117.50
|   |   |   |   |   |--- avg_price_per_room <= 93.58
|   |   |   |   |   |   |--- weights: [214.72, 227.72] class: 1
|   |   |   |   |   |--- avg_price_per_room >  93.58
|   |   |   |   |   |   |--- weights: [82.76, 285.41] class: 1
|   |   |   |   |--- lead_time >  117.50
|   |   |   |   |   |--- no_of_week_nights <= 1.50
|   |   |   |   |   |   |--- weights: [87.23, 81.98] class: 0
|   |   |   |   |   |--- no_of_week_nights >  1.50
|   |   |   |   |   |   |--- weights: [228.14, 48.58] class: 0
|   |   |--- market_segment_type_Online >  0.50
|   |   |   |--- lead_time <= 13.50
|   |   |   |   |--- avg_price_per_room <= 99.44
|   |   |   |   |   |--- arrival_month <= 1.50
|   |   |   |   |   |   |--- weights: [92.45, 0.00] class: 0
|   |   |   |   |   |--- arrival_month >  1.50
|   |   |   |   |   |   |--- weights: [363.83, 132.08] class: 0
|   |   |   |   |--- avg_price_per_room >  99.44
|   |   |   |   |   |--- lead_time <= 3.50
|   |   |   |   |   |   |--- weights: [219.94, 85.01] class: 0
|   |   |   |   |   |--- lead_time >  3.50
|   |   |   |   |   |   |--- weights: [132.71, 280.85] class: 1
|   |   |   |--- lead_time >  13.50
|   |   |   |   |--- required_car_parking_space <= 0.50
|   |   |   |   |   |--- avg_price_per_room <= 71.92
|   |   |   |   |   |   |--- weights: [158.80, 159.40] class: 1
|   |   |   |   |   |--- avg_price_per_room >  71.92
|   |   |   |   |   |   |--- weights: [850.67, 3543.28] class: 1
|   |   |   |   |--- required_car_parking_space >  0.50
|   |   |   |   |   |--- weights: [48.46, 1.52] class: 0
|   |--- no_of_special_requests >  0.50
|   |   |--- no_of_special_requests <= 1.50
|   |   |   |--- market_segment_type_Online <= 0.50
|   |   |   |   |--- lead_time <= 102.50
|   |   |   |   |   |--- type_of_meal_plan_Not Selected <= 0.50
|   |   |   |   |   |   |--- weights: [697.09, 9.11] class: 0
|   |   |   |   |   |--- type_of_meal_plan_Not Selected >  0.50
|   |   |   |   |   |   |--- weights: [15.66, 9.11] class: 0
|   |   |   |   |--- lead_time >  102.50
|   |   |   |   |   |--- no_of_week_nights <= 2.50
|   |   |   |   |   |   |--- weights: [32.06, 19.74] class: 0
|   |   |   |   |   |--- no_of_week_nights >  2.50
|   |   |   |   |   |   |--- weights: [44.73, 3.04] class: 0
|   |   |   |--- market_segment_type_Online >  0.50
|   |   |   |   |--- lead_time <= 8.50
|   |   |   |   |   |--- lead_time <= 4.50
|   |   |   |   |   |   |--- weights: [498.03, 44.03] class: 0
|   |   |   |   |   |--- lead_time >  4.50
|   |   |   |   |   |   |--- weights: [258.71, 63.76] class: 0
|   |   |   |   |--- lead_time >  8.50
|   |   |   |   |   |--- required_car_parking_space <= 0.50
|   |   |   |   |   |   |--- weights: [2512.51, 1451.32] class: 0
|   |   |   |   |   |--- required_car_parking_space >  0.50
|   |   |   |   |   |   |--- weights: [134.20, 1.52] class: 0
|   |   |--- no_of_special_requests >  1.50
|   |   |   |--- lead_time <= 90.50
|   |   |   |   |--- no_of_week_nights <= 3.50
|   |   |   |   |   |--- weights: [1585.04, 0.00] class: 0
|   |   |   |   |--- no_of_week_nights >  3.50
|   |   |   |   |   |--- no_of_special_requests <= 2.50
|   |   |   |   |   |   |--- weights: [180.42, 57.69] class: 0
|   |   |   |   |   |--- no_of_special_requests >  2.50
|   |   |   |   |   |   |--- weights: [52.19, 0.00] class: 0
|   |   |   |--- lead_time >  90.50
|   |   |   |   |--- no_of_special_requests <= 2.50
|   |   |   |   |   |--- arrival_month <= 8.50
|   |   |   |   |   |   |--- weights: [184.90, 56.17] class: 0
|   |   |   |   |   |--- arrival_month >  8.50
|   |   |   |   |   |   |--- weights: [106.61, 106.27] class: 0
|   |   |   |   |--- no_of_special_requests >  2.50
|   |   |   |   |   |--- weights: [67.10, 0.00] class: 0
|--- lead_time >  151.50
|   |--- avg_price_per_room <= 100.04
|   |   |--- no_of_special_requests <= 0.50
|   |   |   |--- no_of_adults <= 1.50
|   |   |   |   |--- market_segment_type_Online <= 0.50
|   |   |   |   |   |--- lead_time <= 163.50
|   |   |   |   |   |   |--- weights: [3.73, 24.29] class: 1
|   |   |   |   |   |--- lead_time >  163.50
|   |   |   |   |   |   |--- weights: [257.96, 62.24] class: 0
|   |   |   |   |--- market_segment_type_Online >  0.50
|   |   |   |   |   |--- avg_price_per_room <= 2.50
|   |   |   |   |   |   |--- weights: [8.95, 3.04] class: 0
|   |   |   |   |   |--- avg_price_per_room >  2.50
|   |   |   |   |   |   |--- weights: [0.75, 97.16] class: 1
|   |   |   |--- no_of_adults >  1.50
|   |   |   |   |--- avg_price_per_room <= 82.47
|   |   |   |   |   |--- market_segment_type_Offline <= 0.50
|   |   |   |   |   |   |--- weights: [2.98, 282.37] class: 1
|   |   |   |   |   |--- market_segment_type_Offline >  0.50
|   |   |   |   |   |   |--- weights: [213.97, 385.60] class: 1
|   |   |   |   |--- avg_price_per_room >  82.47
|   |   |   |   |   |--- no_of_adults <= 2.50
|   |   |   |   |   |   |--- weights: [23.86, 1030.80] class: 1
|   |   |   |   |   |--- no_of_adults >  2.50
|   |   |   |   |   |   |--- weights: [5.22, 0.00] class: 0
|   |   |--- no_of_special_requests >  0.50
|   |   |   |--- no_of_weekend_nights <= 0.50
|   |   |   |   |--- lead_time <= 180.50
|   |   |   |   |   |--- lead_time <= 159.50
|   |   |   |   |   |   |--- weights: [7.46, 7.59] class: 1
|   |   |   |   |   |--- lead_time >  159.50
|   |   |   |   |   |   |--- weights: [37.28, 4.55] class: 0
|   |   |   |   |--- lead_time >  180.50
|   |   |   |   |   |--- no_of_special_requests <= 2.50
|   |   |   |   |   |   |--- weights: [20.13, 212.54] class: 1
|   |   |   |   |   |--- no_of_special_requests >  2.50
|   |   |   |   |   |   |--- weights: [8.95, 0.00] class: 0
|   |   |   |--- no_of_weekend_nights >  0.50
|   |   |   |   |--- market_segment_type_Offline <= 0.50
|   |   |   |   |   |--- arrival_month <= 11.50
|   |   |   |   |   |   |--- weights: [231.12, 110.82] class: 0
|   |   |   |   |   |--- arrival_month >  11.50
|   |   |   |   |   |   |--- weights: [19.38, 34.92] class: 1
|   |   |   |   |--- market_segment_type_Offline >  0.50
|   |   |   |   |   |--- lead_time <= 348.50
|   |   |   |   |   |   |--- weights: [106.61, 3.04] class: 0
|   |   |   |   |   |--- lead_time >  348.50
|   |   |   |   |   |   |--- weights: [5.96, 4.55] class: 0
|   |--- avg_price_per_room >  100.04
|   |   |--- arrival_month <= 11.50
|   |   |   |--- no_of_special_requests <= 2.50
|   |   |   |   |--- weights: [0.00, 3200.19] class: 1
|   |   |   |--- no_of_special_requests >  2.50
|   |   |   |   |--- weights: [23.11, 0.00] class: 0
|   |   |--- arrival_month >  11.50
|   |   |   |--- no_of_special_requests <= 0.50
|   |   |   |   |--- weights: [35.04, 0.00] class: 0
|   |   |   |--- no_of_special_requests >  0.50
|   |   |   |   |--- arrival_date <= 24.50
|   |   |   |   |   |--- weights: [3.73, 0.00] class: 0
|   |   |   |   |--- arrival_date >  24.50
|   |   |   |   |   |--- weights: [3.73, 22.77] class: 1

In [124]:
# importance of features in the tree building

importances = estimator.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()

Tree 3: Cost Complexity Post- Pruning¶

Classifiers, Alphas¶
In [125]:
clf = DecisionTreeClassifier(random_state=1, class_weight="balanced")
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = abs(path.ccp_alphas), path.impurities
In [126]:
pd.DataFrame(path)
Out[126]:
ccp_alphas impurities
0 0.00000 0.00838
1 0.00000 0.00838
2 0.00000 0.00838
3 0.00000 0.00838
4 0.00000 0.00838
... ... ...
1843 0.00890 0.32806
1844 0.00980 0.33786
1845 0.01272 0.35058
1846 0.03412 0.41882
1847 0.08118 0.50000

1848 rows × 2 columns

In [127]:
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(ccp_alphas[:-1], impurities[:-1], marker="o", drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total Impurity vs effective alpha for training set")
plt.show()

Next, we train a decision tree using effective alphas. The last value in ccp_alphas is the alpha value that prunes the whole tree, leaving the tree, clfs[-1], with one node.

In [129]:
clfs = []
for ccp_alpha in ccp_alphas:
    clf = DecisionTreeClassifier(
        random_state=1, ccp_alpha=ccp_alpha, class_weight="balanced"
    )
    clf.fit(X_train,y_train)
    clfs.append(clf)
print(
    "Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
        clfs[-1].tree_.node_count, ccp_alphas[-1]
    )
)
Number of nodes in the last tree is: 1 with ccp_alpha: 0.08117914389136943
In [130]:
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]

node_counts = [clf.tree_.node_count for clf in clfs]
depth = [clf.tree_.max_depth for clf in clfs]
fig, ax = plt.subplots(2, 1, figsize=(10, 7))
ax[0].plot(ccp_alphas, node_counts, marker="o", drawstyle="steps-post")
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("number of nodes")
ax[0].set_title("Number of nodes vs alpha")
ax[1].plot(ccp_alphas, depth, marker="o", drawstyle="steps-post")
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("depth of tree")
ax[1].set_title("Depth vs alpha")
fig.tight_layout()
F1 Score vs alpha for training and testing sets¶
In [131]:
f1_train = []
for clf in clfs:
    pred_train = clf.predict(X_train)
    values_train = f1_score(y_train, pred_train)
    f1_train.append(values_train)

f1_test = []
for clf in clfs:
    pred_test = clf.predict(X_test)
    values_test = f1_score(y_test, pred_test)
    f1_test.append(values_test)
In [132]:
fig, ax = plt.subplots(figsize=(15, 5))
ax.set_xlabel("alpha")
ax.set_ylabel("F1 Score")
ax.set_title("F1 Score vs alpha for training and testing sets")
ax.plot(ccp_alphas, f1_train, marker="o", label="train", drawstyle="steps-post")
ax.plot(ccp_alphas, f1_test, marker="o", label="test", drawstyle="steps-post")
ax.legend()
plt.show()
In [133]:
index_best_model = np.argmax(f1_test)
best_model = clfs[index_best_model]
print(best_model)
DecisionTreeClassifier(ccp_alpha=0.00012291224171537176,
                       class_weight='balanced', random_state=1)
Checking performance on training set¶
In [134]:
tree3_confmat_train= confusion_matrix_statsmodels(best_model, X_train, y_train)
In [135]:
tree3_perf_train = decision_tree_perf_train = model_performance_classification_statsmodels(
    best_model, X_train, y_train
)
tree3_perf_train
Out[135]:
Accuracy Recall Precision F1
0 0.89946 0.90231 0.81297 0.85531
Checking performance on test set¶
In [136]:
tree3_confmat_test= confusion_matrix_statsmodels(best_model, X_test, y_test)
In [137]:
tree3_perf_test = decision_tree_perf_train = model_performance_classification_statsmodels(
    best_model, X_test, y_test
)
tree3_perf_test
Out[137]:
Accuracy Recall Precision F1
0 0.86925 0.85548 0.76725 0.80897

Back to slightly overfitting the training data; the F1 score drops from a 0.85 to a 0.81.

In [138]:
plt.figure(figsize=(20, 10))

out = tree.plot_tree(
    best_model,
    feature_names=feature_names,
    filled=True,
    fontsize=9,
    node_ids=False,
    class_names=None,
)
for o in out:
    arrow = o.arrow_patch
    if arrow is not None:
        arrow.set_edgecolor("black")
        arrow.set_linewidth(1)
plt.show()

I dont have a a good grasp on how complex is too complex. The text report is certainly easier for me to parse.

In [139]:
# Text report showing the rules of a decision tree -

print(tree.export_text(best_model, feature_names=feature_names, show_weights=True))
|--- lead_time <= 151.50
|   |--- no_of_special_requests <= 0.50
|   |   |--- market_segment_type_Online <= 0.50
|   |   |   |--- lead_time <= 90.50
|   |   |   |   |--- no_of_weekend_nights <= 0.50
|   |   |   |   |   |--- avg_price_per_room <= 196.50
|   |   |   |   |   |   |--- market_segment_type_Offline <= 0.50
|   |   |   |   |   |   |   |--- lead_time <= 16.50
|   |   |   |   |   |   |   |   |--- avg_price_per_room <= 68.50
|   |   |   |   |   |   |   |   |   |--- weights: [207.26, 10.63] class: 0
|   |   |   |   |   |   |   |   |--- avg_price_per_room >  68.50
|   |   |   |   |   |   |   |   |   |--- arrival_date <= 29.50
|   |   |   |   |   |   |   |   |   |   |--- no_of_adults <= 1.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 2
|   |   |   |   |   |   |   |   |   |   |--- no_of_adults >  1.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 5
|   |   |   |   |   |   |   |   |   |--- arrival_date >  29.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [2.24, 7.59] class: 1
|   |   |   |   |   |   |   |--- lead_time >  16.50
|   |   |   |   |   |   |   |   |--- avg_price_per_room <= 135.00
|   |   |   |   |   |   |   |   |   |--- arrival_month <= 11.50
|   |   |   |   |   |   |   |   |   |   |--- repeated_guest <= 0.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 4
|   |   |   |   |   |   |   |   |   |   |--- repeated_guest >  0.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [11.18, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- arrival_month >  11.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [21.62, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- avg_price_per_room >  135.00
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 12.14] class: 1
|   |   |   |   |   |   |--- market_segment_type_Offline >  0.50
|   |   |   |   |   |   |   |--- weights: [1199.59, 0.00] class: 0
|   |   |   |   |   |--- avg_price_per_room >  196.50
|   |   |   |   |   |   |--- weights: [0.75, 25.81] class: 1
|   |   |   |   |--- no_of_weekend_nights >  0.50
|   |   |   |   |   |--- lead_time <= 68.50
|   |   |   |   |   |   |--- arrival_month <= 9.50
|   |   |   |   |   |   |   |--- avg_price_per_room <= 63.29
|   |   |   |   |   |   |   |   |--- arrival_date <= 20.50
|   |   |   |   |   |   |   |   |   |--- type_of_meal_plan_Not Selected <= 0.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [41.75, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- type_of_meal_plan_Not Selected >  0.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.75, 3.04] class: 1
|   |   |   |   |   |   |   |   |--- arrival_date >  20.50
|   |   |   |   |   |   |   |   |   |--- avg_price_per_room <= 59.75
|   |   |   |   |   |   |   |   |   |   |--- arrival_date <= 23.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [1.49, 12.14] class: 1
|   |   |   |   |   |   |   |   |   |   |--- arrival_date >  23.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [14.91, 1.52] class: 0
|   |   |   |   |   |   |   |   |   |--- avg_price_per_room >  59.75
|   |   |   |   |   |   |   |   |   |   |--- lead_time <= 44.00
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [0.75, 59.21] class: 1
|   |   |   |   |   |   |   |   |   |   |--- lead_time >  44.00
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [3.73, 0.00] class: 0
|   |   |   |   |   |   |   |--- avg_price_per_room >  63.29
|   |   |   |   |   |   |   |   |--- no_of_weekend_nights <= 3.50
|   |   |   |   |   |   |   |   |   |--- lead_time <= 59.50
|   |   |   |   |   |   |   |   |   |   |--- arrival_month <= 7.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 3
|   |   |   |   |   |   |   |   |   |   |--- arrival_month >  7.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 3
|   |   |   |   |   |   |   |   |   |--- lead_time >  59.50
|   |   |   |   |   |   |   |   |   |   |--- arrival_month <= 5.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 2
|   |   |   |   |   |   |   |   |   |   |--- arrival_month >  5.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [20.13, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- no_of_weekend_nights >  3.50
|   |   |   |   |   |   |   |   |   |--- weights: [0.75, 15.18] class: 1
|   |   |   |   |   |   |--- arrival_month >  9.50
|   |   |   |   |   |   |   |--- weights: [413.04, 27.33] class: 0
|   |   |   |   |   |--- lead_time >  68.50
|   |   |   |   |   |   |--- avg_price_per_room <= 99.98
|   |   |   |   |   |   |   |--- arrival_month <= 3.50
|   |   |   |   |   |   |   |   |--- avg_price_per_room <= 62.50
|   |   |   |   |   |   |   |   |   |--- weights: [15.66, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- avg_price_per_room >  62.50
|   |   |   |   |   |   |   |   |   |--- avg_price_per_room <= 80.38
|   |   |   |   |   |   |   |   |   |   |--- lead_time <= 81.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 3
|   |   |   |   |   |   |   |   |   |   |--- lead_time >  81.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [2.24, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- avg_price_per_room >  80.38
|   |   |   |   |   |   |   |   |   |   |--- weights: [3.73, 0.00] class: 0
|   |   |   |   |   |   |   |--- arrival_month >  3.50
|   |   |   |   |   |   |   |   |--- no_of_week_nights <= 2.50
|   |   |   |   |   |   |   |   |   |--- weights: [55.17, 3.04] class: 0
|   |   |   |   |   |   |   |   |--- no_of_week_nights >  2.50
|   |   |   |   |   |   |   |   |   |--- lead_time <= 73.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 4.55] class: 1
|   |   |   |   |   |   |   |   |   |--- lead_time >  73.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [21.62, 4.55] class: 0
|   |   |   |   |   |   |--- avg_price_per_room >  99.98
|   |   |   |   |   |   |   |--- arrival_year <= 2017.50
|   |   |   |   |   |   |   |   |--- weights: [8.95, 0.00] class: 0
|   |   |   |   |   |   |   |--- arrival_year >  2017.50
|   |   |   |   |   |   |   |   |--- avg_price_per_room <= 132.43
|   |   |   |   |   |   |   |   |   |--- weights: [9.69, 122.97] class: 1
|   |   |   |   |   |   |   |   |--- avg_price_per_room >  132.43
|   |   |   |   |   |   |   |   |   |--- weights: [6.71, 0.00] class: 0
|   |   |   |--- lead_time >  90.50
|   |   |   |   |--- lead_time <= 117.50
|   |   |   |   |   |--- avg_price_per_room <= 93.58
|   |   |   |   |   |   |--- avg_price_per_room <= 75.07
|   |   |   |   |   |   |   |--- no_of_week_nights <= 2.50
|   |   |   |   |   |   |   |   |--- avg_price_per_room <= 58.75
|   |   |   |   |   |   |   |   |   |--- weights: [5.96, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- avg_price_per_room >  58.75
|   |   |   |   |   |   |   |   |   |--- market_segment_type_Offline <= 0.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [4.47, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- market_segment_type_Offline >  0.50
|   |   |   |   |   |   |   |   |   |   |--- arrival_month <= 4.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [2.24, 118.41] class: 1
|   |   |   |   |   |   |   |   |   |   |--- arrival_month >  4.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 4
|   |   |   |   |   |   |   |--- no_of_week_nights >  2.50
|   |   |   |   |   |   |   |   |--- arrival_date <= 11.50
|   |   |   |   |   |   |   |   |   |--- weights: [31.31, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- arrival_date >  11.50
|   |   |   |   |   |   |   |   |   |--- weights: [29.08, 15.18] class: 0
|   |   |   |   |   |   |--- avg_price_per_room >  75.07
|   |   |   |   |   |   |   |--- arrival_month <= 3.50
|   |   |   |   |   |   |   |   |--- weights: [59.64, 3.04] class: 0
|   |   |   |   |   |   |   |--- arrival_month >  3.50
|   |   |   |   |   |   |   |   |--- arrival_month <= 4.50
|   |   |   |   |   |   |   |   |   |--- weights: [1.49, 16.70] class: 1
|   |   |   |   |   |   |   |   |--- arrival_month >  4.50
|   |   |   |   |   |   |   |   |   |--- no_of_adults <= 1.50
|   |   |   |   |   |   |   |   |   |   |--- avg_price_per_room <= 86.00
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [2.24, 16.70] class: 1
|   |   |   |   |   |   |   |   |   |   |--- avg_price_per_room >  86.00
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [8.95, 3.04] class: 0
|   |   |   |   |   |   |   |   |   |--- no_of_adults >  1.50
|   |   |   |   |   |   |   |   |   |   |--- arrival_date <= 22.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [44.73, 4.55] class: 0
|   |   |   |   |   |   |   |   |   |   |--- arrival_date >  22.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 3
|   |   |   |   |   |--- avg_price_per_room >  93.58
|   |   |   |   |   |   |--- arrival_date <= 11.50
|   |   |   |   |   |   |   |--- no_of_week_nights <= 1.50
|   |   |   |   |   |   |   |   |--- weights: [16.40, 39.47] class: 1
|   |   |   |   |   |   |   |--- no_of_week_nights >  1.50
|   |   |   |   |   |   |   |   |--- weights: [20.13, 6.07] class: 0
|   |   |   |   |   |   |--- arrival_date >  11.50
|   |   |   |   |   |   |   |--- avg_price_per_room <= 102.09
|   |   |   |   |   |   |   |   |--- weights: [5.22, 144.22] class: 1
|   |   |   |   |   |   |   |--- avg_price_per_room >  102.09
|   |   |   |   |   |   |   |   |--- avg_price_per_room <= 109.50
|   |   |   |   |   |   |   |   |   |--- no_of_week_nights <= 1.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.75, 16.70] class: 1
|   |   |   |   |   |   |   |   |   |--- no_of_week_nights >  1.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [33.55, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- avg_price_per_room >  109.50
|   |   |   |   |   |   |   |   |   |--- avg_price_per_room <= 124.25
|   |   |   |   |   |   |   |   |   |   |--- weights: [2.98, 75.91] class: 1
|   |   |   |   |   |   |   |   |   |--- avg_price_per_room >  124.25
|   |   |   |   |   |   |   |   |   |   |--- weights: [3.73, 3.04] class: 0
|   |   |   |   |--- lead_time >  117.50
|   |   |   |   |   |--- no_of_week_nights <= 1.50
|   |   |   |   |   |   |--- arrival_date <= 7.50
|   |   |   |   |   |   |   |--- weights: [38.02, 0.00] class: 0
|   |   |   |   |   |   |--- arrival_date >  7.50
|   |   |   |   |   |   |   |--- avg_price_per_room <= 93.58
|   |   |   |   |   |   |   |   |--- avg_price_per_room <= 65.38
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 4.55] class: 1
|   |   |   |   |   |   |   |   |--- avg_price_per_room >  65.38
|   |   |   |   |   |   |   |   |   |--- weights: [24.60, 3.04] class: 0
|   |   |   |   |   |   |   |--- avg_price_per_room >  93.58
|   |   |   |   |   |   |   |   |--- arrival_date <= 28.00
|   |   |   |   |   |   |   |   |   |--- weights: [14.91, 72.87] class: 1
|   |   |   |   |   |   |   |   |--- arrival_date >  28.00
|   |   |   |   |   |   |   |   |   |--- weights: [9.69, 1.52] class: 0
|   |   |   |   |   |--- no_of_week_nights >  1.50
|   |   |   |   |   |   |--- no_of_adults <= 1.50
|   |   |   |   |   |   |   |--- weights: [84.25, 0.00] class: 0
|   |   |   |   |   |   |--- no_of_adults >  1.50
|   |   |   |   |   |   |   |--- lead_time <= 125.50
|   |   |   |   |   |   |   |   |--- avg_price_per_room <= 90.85
|   |   |   |   |   |   |   |   |   |--- avg_price_per_room <= 87.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [13.42, 13.66] class: 1
|   |   |   |   |   |   |   |   |   |--- avg_price_per_room >  87.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 15.18] class: 1
|   |   |   |   |   |   |   |   |--- avg_price_per_room >  90.85
|   |   |   |   |   |   |   |   |   |--- weights: [10.44, 0.00] class: 0
|   |   |   |   |   |   |   |--- lead_time >  125.50
|   |   |   |   |   |   |   |   |--- arrival_date <= 19.50
|   |   |   |   |   |   |   |   |   |--- weights: [58.15, 18.22] class: 0
|   |   |   |   |   |   |   |   |--- arrival_date >  19.50
|   |   |   |   |   |   |   |   |   |--- weights: [61.88, 1.52] class: 0
|   |   |--- market_segment_type_Online >  0.50
|   |   |   |--- lead_time <= 13.50
|   |   |   |   |--- avg_price_per_room <= 99.44
|   |   |   |   |   |--- arrival_month <= 1.50
|   |   |   |   |   |   |--- weights: [92.45, 0.00] class: 0
|   |   |   |   |   |--- arrival_month >  1.50
|   |   |   |   |   |   |--- arrival_month <= 8.50
|   |   |   |   |   |   |   |--- no_of_weekend_nights <= 1.50
|   |   |   |   |   |   |   |   |--- avg_price_per_room <= 70.05
|   |   |   |   |   |   |   |   |   |--- weights: [31.31, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- avg_price_per_room >  70.05
|   |   |   |   |   |   |   |   |   |--- lead_time <= 5.50
|   |   |   |   |   |   |   |   |   |   |--- no_of_adults <= 1.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [38.77, 1.52] class: 0
|   |   |   |   |   |   |   |   |   |   |--- no_of_adults >  1.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 2
|   |   |   |   |   |   |   |   |   |--- lead_time >  5.50
|   |   |   |   |   |   |   |   |   |   |--- arrival_date <= 3.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [6.71, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |   |--- arrival_date >  3.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [34.30, 40.99] class: 1
|   |   |   |   |   |   |   |--- no_of_weekend_nights >  1.50
|   |   |   |   |   |   |   |   |--- no_of_adults <= 1.50
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 19.74] class: 1
|   |   |   |   |   |   |   |   |--- no_of_adults >  1.50
|   |   |   |   |   |   |   |   |   |--- lead_time <= 2.50
|   |   |   |   |   |   |   |   |   |   |--- avg_price_per_room <= 74.21
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [0.75, 3.04] class: 1
|   |   |   |   |   |   |   |   |   |   |--- avg_price_per_room >  74.21
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [9.69, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- lead_time >  2.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [4.47, 10.63] class: 1
|   |   |   |   |   |   |--- arrival_month >  8.50
|   |   |   |   |   |   |   |--- no_of_week_nights <= 3.50
|   |   |   |   |   |   |   |   |--- weights: [155.07, 6.07] class: 0
|   |   |   |   |   |   |   |--- no_of_week_nights >  3.50
|   |   |   |   |   |   |   |   |--- arrival_month <= 11.50
|   |   |   |   |   |   |   |   |   |--- weights: [3.73, 10.63] class: 1
|   |   |   |   |   |   |   |   |--- arrival_month >  11.50
|   |   |   |   |   |   |   |   |   |--- weights: [7.46, 0.00] class: 0
|   |   |   |   |--- avg_price_per_room >  99.44
|   |   |   |   |   |--- lead_time <= 3.50
|   |   |   |   |   |   |--- avg_price_per_room <= 202.67
|   |   |   |   |   |   |   |--- no_of_week_nights <= 4.50
|   |   |   |   |   |   |   |   |--- arrival_month <= 5.50
|   |   |   |   |   |   |   |   |   |--- weights: [63.37, 30.36] class: 0
|   |   |   |   |   |   |   |   |--- arrival_month >  5.50
|   |   |   |   |   |   |   |   |   |--- arrival_date <= 20.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [115.56, 12.14] class: 0
|   |   |   |   |   |   |   |   |   |--- arrival_date >  20.50
|   |   |   |   |   |   |   |   |   |   |--- arrival_date <= 24.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 3
|   |   |   |   |   |   |   |   |   |   |--- arrival_date >  24.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [28.33, 3.04] class: 0
|   |   |   |   |   |   |   |--- no_of_week_nights >  4.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 6.07] class: 1
|   |   |   |   |   |   |--- avg_price_per_room >  202.67
|   |   |   |   |   |   |   |--- weights: [0.75, 22.77] class: 1
|   |   |   |   |   |--- lead_time >  3.50
|   |   |   |   |   |   |--- arrival_month <= 8.50
|   |   |   |   |   |   |   |--- avg_price_per_room <= 119.25
|   |   |   |   |   |   |   |   |--- avg_price_per_room <= 118.50
|   |   |   |   |   |   |   |   |   |--- weights: [18.64, 59.21] class: 1
|   |   |   |   |   |   |   |   |--- avg_price_per_room >  118.50
|   |   |   |   |   |   |   |   |   |--- weights: [8.20, 1.52] class: 0
|   |   |   |   |   |   |   |--- avg_price_per_room >  119.25
|   |   |   |   |   |   |   |   |--- weights: [34.30, 171.55] class: 1
|   |   |   |   |   |   |--- arrival_month >  8.50
|   |   |   |   |   |   |   |--- arrival_year <= 2017.50
|   |   |   |   |   |   |   |   |--- weights: [26.09, 1.52] class: 0
|   |   |   |   |   |   |   |--- arrival_year >  2017.50
|   |   |   |   |   |   |   |   |--- arrival_month <= 11.50
|   |   |   |   |   |   |   |   |   |--- arrival_date <= 14.00
|   |   |   |   |   |   |   |   |   |   |--- weights: [9.69, 36.43] class: 1
|   |   |   |   |   |   |   |   |   |--- arrival_date >  14.00
|   |   |   |   |   |   |   |   |   |   |--- avg_price_per_room <= 208.67
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 2
|   |   |   |   |   |   |   |   |   |   |--- avg_price_per_room >  208.67
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 4.55] class: 1
|   |   |   |   |   |   |   |   |--- arrival_month >  11.50
|   |   |   |   |   |   |   |   |   |--- weights: [15.66, 0.00] class: 0
|   |   |   |--- lead_time >  13.50
|   |   |   |   |--- required_car_parking_space <= 0.50
|   |   |   |   |   |--- avg_price_per_room <= 71.92
|   |   |   |   |   |   |--- avg_price_per_room <= 59.43
|   |   |   |   |   |   |   |--- lead_time <= 84.50
|   |   |   |   |   |   |   |   |--- weights: [50.70, 7.59] class: 0
|   |   |   |   |   |   |   |--- lead_time >  84.50
|   |   |   |   |   |   |   |   |--- arrival_year <= 2017.50
|   |   |   |   |   |   |   |   |   |--- arrival_date <= 27.00
|   |   |   |   |   |   |   |   |   |   |--- lead_time <= 131.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [0.75, 15.18] class: 1
|   |   |   |   |   |   |   |   |   |   |--- lead_time >  131.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [2.24, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- arrival_date >  27.00
|   |   |   |   |   |   |   |   |   |   |--- weights: [3.73, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- arrival_year >  2017.50
|   |   |   |   |   |   |   |   |   |--- weights: [10.44, 0.00] class: 0
|   |   |   |   |   |   |--- avg_price_per_room >  59.43
|   |   |   |   |   |   |   |--- lead_time <= 25.50
|   |   |   |   |   |   |   |   |--- weights: [20.88, 6.07] class: 0
|   |   |   |   |   |   |   |--- lead_time >  25.50
|   |   |   |   |   |   |   |   |--- avg_price_per_room <= 71.34
|   |   |   |   |   |   |   |   |   |--- arrival_month <= 3.50
|   |   |   |   |   |   |   |   |   |   |--- lead_time <= 68.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [15.66, 78.94] class: 1
|   |   |   |   |   |   |   |   |   |   |--- lead_time >  68.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 3
|   |   |   |   |   |   |   |   |   |--- arrival_month >  3.50
|   |   |   |   |   |   |   |   |   |   |--- lead_time <= 102.00
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 3
|   |   |   |   |   |   |   |   |   |   |--- lead_time >  102.00
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [12.67, 3.04] class: 0
|   |   |   |   |   |   |   |   |--- avg_price_per_room >  71.34
|   |   |   |   |   |   |   |   |   |--- weights: [11.18, 0.00] class: 0
|   |   |   |   |   |--- avg_price_per_room >  71.92
|   |   |   |   |   |   |--- arrival_year <= 2017.50
|   |   |   |   |   |   |   |--- lead_time <= 65.50
|   |   |   |   |   |   |   |   |--- avg_price_per_room <= 120.45
|   |   |   |   |   |   |   |   |   |--- weights: [79.77, 9.11] class: 0
|   |   |   |   |   |   |   |   |--- avg_price_per_room >  120.45
|   |   |   |   |   |   |   |   |   |--- no_of_week_nights <= 1.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [3.73, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- no_of_week_nights >  1.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [3.73, 12.14] class: 1
|   |   |   |   |   |   |   |--- lead_time >  65.50
|   |   |   |   |   |   |   |   |--- type_of_meal_plan_Meal Plan 2 <= 0.50
|   |   |   |   |   |   |   |   |   |--- arrival_date <= 27.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [16.40, 47.06] class: 1
|   |   |   |   |   |   |   |   |   |--- arrival_date >  27.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [3.73, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- type_of_meal_plan_Meal Plan 2 >  0.50
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 63.76] class: 1
|   |   |   |   |   |   |--- arrival_year >  2017.50
|   |   |   |   |   |   |   |--- avg_price_per_room <= 104.31
|   |   |   |   |   |   |   |   |--- lead_time <= 25.50
|   |   |   |   |   |   |   |   |   |--- arrival_month <= 11.50
|   |   |   |   |   |   |   |   |   |   |--- arrival_month <= 1.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [16.40, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |   |--- arrival_month >  1.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [38.77, 118.41] class: 1
|   |   |   |   |   |   |   |   |   |--- arrival_month >  11.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [23.11, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- lead_time >  25.50
|   |   |   |   |   |   |   |   |   |--- type_of_meal_plan_Not Selected <= 0.50
|   |   |   |   |   |   |   |   |   |   |--- no_of_week_nights <= 1.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [39.51, 185.21] class: 1
|   |   |   |   |   |   |   |   |   |   |--- no_of_week_nights >  1.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 6
|   |   |   |   |   |   |   |   |   |--- type_of_meal_plan_Not Selected >  0.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [73.81, 411.41] class: 1
|   |   |   |   |   |   |   |--- avg_price_per_room >  104.31
|   |   |   |   |   |   |   |   |--- arrival_month <= 10.50
|   |   |   |   |   |   |   |   |   |--- room_type_reserved_Room_Type 5 <= 0.50
|   |   |   |   |   |   |   |   |   |   |--- avg_price_per_room <= 195.30
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 9
|   |   |   |   |   |   |   |   |   |   |--- avg_price_per_room >  195.30
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [0.75, 138.15] class: 1
|   |   |   |   |   |   |   |   |   |--- room_type_reserved_Room_Type 5 >  0.50
|   |   |   |   |   |   |   |   |   |   |--- arrival_date <= 22.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [11.18, 6.07] class: 0
|   |   |   |   |   |   |   |   |   |   |--- arrival_date >  22.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [0.75, 9.11] class: 1
|   |   |   |   |   |   |   |   |--- arrival_month >  10.50
|   |   |   |   |   |   |   |   |   |--- avg_price_per_room <= 168.06
|   |   |   |   |   |   |   |   |   |   |--- lead_time <= 22.00
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 2
|   |   |   |   |   |   |   |   |   |   |--- lead_time >  22.00
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [17.15, 83.50] class: 1
|   |   |   |   |   |   |   |   |   |--- avg_price_per_room >  168.06
|   |   |   |   |   |   |   |   |   |   |--- weights: [12.67, 6.07] class: 0
|   |   |   |   |--- required_car_parking_space >  0.50
|   |   |   |   |   |--- weights: [48.46, 1.52] class: 0
|   |--- no_of_special_requests >  0.50
|   |   |--- no_of_special_requests <= 1.50
|   |   |   |--- market_segment_type_Online <= 0.50
|   |   |   |   |--- lead_time <= 102.50
|   |   |   |   |   |--- type_of_meal_plan_Not Selected <= 0.50
|   |   |   |   |   |   |--- weights: [697.09, 9.11] class: 0
|   |   |   |   |   |--- type_of_meal_plan_Not Selected >  0.50
|   |   |   |   |   |   |--- lead_time <= 63.00
|   |   |   |   |   |   |   |--- weights: [15.66, 1.52] class: 0
|   |   |   |   |   |   |--- lead_time >  63.00
|   |   |   |   |   |   |   |--- weights: [0.00, 7.59] class: 1
|   |   |   |   |--- lead_time >  102.50
|   |   |   |   |   |--- no_of_week_nights <= 2.50
|   |   |   |   |   |   |--- lead_time <= 105.00
|   |   |   |   |   |   |   |--- weights: [0.75, 6.07] class: 1
|   |   |   |   |   |   |--- lead_time >  105.00
|   |   |   |   |   |   |   |--- weights: [31.31, 13.66] class: 0
|   |   |   |   |   |--- no_of_week_nights >  2.50
|   |   |   |   |   |   |--- weights: [44.73, 3.04] class: 0
|   |   |   |--- market_segment_type_Online >  0.50
|   |   |   |   |--- lead_time <= 8.50
|   |   |   |   |   |--- lead_time <= 4.50
|   |   |   |   |   |   |--- no_of_weekend_nights <= 3.50
|   |   |   |   |   |   |   |--- weights: [497.28, 40.99] class: 0
|   |   |   |   |   |   |--- no_of_weekend_nights >  3.50
|   |   |   |   |   |   |   |--- weights: [0.75, 3.04] class: 1
|   |   |   |   |   |--- lead_time >  4.50
|   |   |   |   |   |   |--- arrival_date <= 13.50
|   |   |   |   |   |   |   |--- arrival_month <= 9.50
|   |   |   |   |   |   |   |   |--- weights: [58.90, 36.43] class: 0
|   |   |   |   |   |   |   |--- arrival_month >  9.50
|   |   |   |   |   |   |   |   |--- weights: [33.55, 1.52] class: 0
|   |   |   |   |   |   |--- arrival_date >  13.50
|   |   |   |   |   |   |   |--- type_of_meal_plan_Not Selected <= 0.50
|   |   |   |   |   |   |   |   |--- weights: [123.76, 9.11] class: 0
|   |   |   |   |   |   |   |--- type_of_meal_plan_Not Selected >  0.50
|   |   |   |   |   |   |   |   |--- avg_price_per_room <= 126.33
|   |   |   |   |   |   |   |   |   |--- weights: [32.80, 3.04] class: 0
|   |   |   |   |   |   |   |   |--- avg_price_per_room >  126.33
|   |   |   |   |   |   |   |   |   |--- weights: [9.69, 13.66] class: 1
|   |   |   |   |--- lead_time >  8.50
|   |   |   |   |   |--- required_car_parking_space <= 0.50
|   |   |   |   |   |   |--- avg_price_per_room <= 118.55
|   |   |   |   |   |   |   |--- lead_time <= 61.50
|   |   |   |   |   |   |   |   |--- arrival_month <= 11.50
|   |   |   |   |   |   |   |   |   |--- arrival_month <= 1.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [70.08, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- arrival_month >  1.50
|   |   |   |   |   |   |   |   |   |   |--- no_of_week_nights <= 4.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 11
|   |   |   |   |   |   |   |   |   |   |--- no_of_week_nights >  4.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 6
|   |   |   |   |   |   |   |   |--- arrival_month >  11.50
|   |   |   |   |   |   |   |   |   |--- weights: [126.74, 1.52] class: 0
|   |   |   |   |   |   |   |--- lead_time >  61.50
|   |   |   |   |   |   |   |   |--- arrival_year <= 2017.50
|   |   |   |   |   |   |   |   |   |--- arrival_month <= 7.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [4.47, 57.69] class: 1
|   |   |   |   |   |   |   |   |   |--- arrival_month >  7.50
|   |   |   |   |   |   |   |   |   |   |--- lead_time <= 66.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [5.22, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |   |--- lead_time >  66.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 5
|   |   |   |   |   |   |   |   |--- arrival_year >  2017.50
|   |   |   |   |   |   |   |   |   |--- arrival_month <= 9.50
|   |   |   |   |   |   |   |   |   |   |--- avg_price_per_room <= 71.93
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [54.43, 3.04] class: 0
|   |   |   |   |   |   |   |   |   |   |--- avg_price_per_room >  71.93
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 10
|   |   |   |   |   |   |   |   |   |--- arrival_month >  9.50
|   |   |   |   |   |   |   |   |   |   |--- no_of_week_nights <= 1.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 4
|   |   |   |   |   |   |   |   |   |   |--- no_of_week_nights >  1.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 6
|   |   |   |   |   |   |--- avg_price_per_room >  118.55
|   |   |   |   |   |   |   |--- arrival_month <= 8.50
|   |   |   |   |   |   |   |   |--- arrival_date <= 19.50
|   |   |   |   |   |   |   |   |   |--- avg_price_per_room <= 177.15
|   |   |   |   |   |   |   |   |   |   |--- avg_price_per_room <= 118.98
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 2
|   |   |   |   |   |   |   |   |   |   |--- avg_price_per_room >  118.98
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 7
|   |   |   |   |   |   |   |   |   |--- avg_price_per_room >  177.15
|   |   |   |   |   |   |   |   |   |   |--- arrival_date <= 7.00
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [6.71, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |   |--- arrival_date >  7.00
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [12.67, 24.29] class: 1
|   |   |   |   |   |   |   |   |--- arrival_date >  19.50
|   |   |   |   |   |   |   |   |   |--- arrival_date <= 27.50
|   |   |   |   |   |   |   |   |   |   |--- avg_price_per_room <= 121.20
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [18.64, 6.07] class: 0
|   |   |   |   |   |   |   |   |   |   |--- avg_price_per_room >  121.20
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 4
|   |   |   |   |   |   |   |   |   |--- arrival_date >  27.50
|   |   |   |   |   |   |   |   |   |   |--- lead_time <= 55.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 2
|   |   |   |   |   |   |   |   |   |   |--- lead_time >  55.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 2
|   |   |   |   |   |   |   |--- arrival_month >  8.50
|   |   |   |   |   |   |   |   |--- arrival_year <= 2017.50
|   |   |   |   |   |   |   |   |   |--- arrival_month <= 9.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [11.93, 10.63] class: 0
|   |   |   |   |   |   |   |   |   |--- arrival_month >  9.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [37.28, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- arrival_year >  2017.50
|   |   |   |   |   |   |   |   |   |--- arrival_month <= 11.50
|   |   |   |   |   |   |   |   |   |   |--- avg_price_per_room <= 119.20
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [9.69, 28.84] class: 1
|   |   |   |   |   |   |   |   |   |   |--- avg_price_per_room >  119.20
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 12
|   |   |   |   |   |   |   |   |   |--- arrival_month >  11.50
|   |   |   |   |   |   |   |   |   |   |--- lead_time <= 100.00
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [49.95, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |   |--- lead_time >  100.00
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [0.75, 18.22] class: 1
|   |   |   |   |   |--- required_car_parking_space >  0.50
|   |   |   |   |   |   |--- weights: [134.20, 1.52] class: 0
|   |   |--- no_of_special_requests >  1.50
|   |   |   |--- lead_time <= 90.50
|   |   |   |   |--- no_of_week_nights <= 3.50
|   |   |   |   |   |--- weights: [1585.04, 0.00] class: 0
|   |   |   |   |--- no_of_week_nights >  3.50
|   |   |   |   |   |--- no_of_special_requests <= 2.50
|   |   |   |   |   |   |--- lead_time <= 6.50
|   |   |   |   |   |   |   |--- weights: [32.06, 1.52] class: 0
|   |   |   |   |   |   |--- lead_time >  6.50
|   |   |   |   |   |   |   |--- room_type_reserved_Room_Type 4 <= 0.50
|   |   |   |   |   |   |   |   |--- weights: [103.63, 50.10] class: 0
|   |   |   |   |   |   |   |--- room_type_reserved_Room_Type 4 >  0.50
|   |   |   |   |   |   |   |   |--- weights: [44.73, 6.07] class: 0
|   |   |   |   |   |--- no_of_special_requests >  2.50
|   |   |   |   |   |   |--- weights: [52.19, 0.00] class: 0
|   |   |   |--- lead_time >  90.50
|   |   |   |   |--- no_of_special_requests <= 2.50
|   |   |   |   |   |--- arrival_month <= 8.50
|   |   |   |   |   |   |--- avg_price_per_room <= 202.95
|   |   |   |   |   |   |   |--- arrival_year <= 2017.50
|   |   |   |   |   |   |   |   |--- arrival_month <= 7.50
|   |   |   |   |   |   |   |   |   |--- weights: [1.49, 9.11] class: 1
|   |   |   |   |   |   |   |   |--- arrival_month >  7.50
|   |   |   |   |   |   |   |   |   |--- weights: [8.20, 3.04] class: 0
|   |   |   |   |   |   |   |--- arrival_year >  2017.50
|   |   |   |   |   |   |   |   |--- lead_time <= 150.50
|   |   |   |   |   |   |   |   |   |--- weights: [175.20, 28.84] class: 0
|   |   |   |   |   |   |   |   |--- lead_time >  150.50
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 4.55] class: 1
|   |   |   |   |   |   |--- avg_price_per_room >  202.95
|   |   |   |   |   |   |   |--- weights: [0.00, 10.63] class: 1
|   |   |   |   |   |--- arrival_month >  8.50
|   |   |   |   |   |   |--- avg_price_per_room <= 153.15
|   |   |   |   |   |   |   |--- room_type_reserved_Room_Type 2 <= 0.50
|   |   |   |   |   |   |   |   |--- avg_price_per_room <= 71.12
|   |   |   |   |   |   |   |   |   |--- weights: [3.73, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- avg_price_per_room >  71.12
|   |   |   |   |   |   |   |   |   |--- avg_price_per_room <= 90.42
|   |   |   |   |   |   |   |   |   |   |--- arrival_month <= 11.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 3
|   |   |   |   |   |   |   |   |   |   |--- arrival_month >  11.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [12.67, 7.59] class: 0
|   |   |   |   |   |   |   |   |   |--- avg_price_per_room >  90.42
|   |   |   |   |   |   |   |   |   |   |--- weights: [64.12, 60.72] class: 0
|   |   |   |   |   |   |   |--- room_type_reserved_Room_Type 2 >  0.50
|   |   |   |   |   |   |   |   |--- weights: [5.96, 0.00] class: 0
|   |   |   |   |   |   |--- avg_price_per_room >  153.15
|   |   |   |   |   |   |   |--- weights: [12.67, 3.04] class: 0
|   |   |   |   |--- no_of_special_requests >  2.50
|   |   |   |   |   |--- weights: [67.10, 0.00] class: 0
|--- lead_time >  151.50
|   |--- avg_price_per_room <= 100.04
|   |   |--- no_of_special_requests <= 0.50
|   |   |   |--- no_of_adults <= 1.50
|   |   |   |   |--- market_segment_type_Online <= 0.50
|   |   |   |   |   |--- lead_time <= 163.50
|   |   |   |   |   |   |--- arrival_month <= 5.00
|   |   |   |   |   |   |   |--- weights: [2.98, 0.00] class: 0
|   |   |   |   |   |   |--- arrival_month >  5.00
|   |   |   |   |   |   |   |--- weights: [0.75, 24.29] class: 1
|   |   |   |   |   |--- lead_time >  163.50
|   |   |   |   |   |   |--- lead_time <= 341.00
|   |   |   |   |   |   |   |--- lead_time <= 173.00
|   |   |   |   |   |   |   |   |--- arrival_date <= 3.50
|   |   |   |   |   |   |   |   |   |--- weights: [46.97, 9.11] class: 0
|   |   |   |   |   |   |   |   |--- arrival_date >  3.50
|   |   |   |   |   |   |   |   |   |--- no_of_weekend_nights <= 1.00
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 13.66] class: 1
|   |   |   |   |   |   |   |   |   |--- no_of_weekend_nights >  1.00
|   |   |   |   |   |   |   |   |   |   |--- weights: [2.24, 0.00] class: 0
|   |   |   |   |   |   |   |--- lead_time >  173.00
|   |   |   |   |   |   |   |   |--- arrival_month <= 5.50
|   |   |   |   |   |   |   |   |   |--- arrival_date <= 7.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 4.55] class: 1
|   |   |   |   |   |   |   |   |   |--- arrival_date >  7.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [6.71, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- arrival_month >  5.50
|   |   |   |   |   |   |   |   |   |--- weights: [188.62, 7.59] class: 0
|   |   |   |   |   |   |--- lead_time >  341.00
|   |   |   |   |   |   |   |--- weights: [13.42, 27.33] class: 1
|   |   |   |   |--- market_segment_type_Online >  0.50
|   |   |   |   |   |--- avg_price_per_room <= 2.50
|   |   |   |   |   |   |--- lead_time <= 285.50
|   |   |   |   |   |   |   |--- weights: [8.20, 0.00] class: 0
|   |   |   |   |   |   |--- lead_time >  285.50
|   |   |   |   |   |   |   |--- weights: [0.75, 3.04] class: 1
|   |   |   |   |   |--- avg_price_per_room >  2.50
|   |   |   |   |   |   |--- weights: [0.75, 97.16] class: 1
|   |   |   |--- no_of_adults >  1.50
|   |   |   |   |--- avg_price_per_room <= 82.47
|   |   |   |   |   |--- market_segment_type_Offline <= 0.50
|   |   |   |   |   |   |--- weights: [2.98, 282.37] class: 1
|   |   |   |   |   |--- market_segment_type_Offline >  0.50
|   |   |   |   |   |   |--- arrival_month <= 11.50
|   |   |   |   |   |   |   |--- lead_time <= 244.00
|   |   |   |   |   |   |   |   |--- no_of_week_nights <= 1.50
|   |   |   |   |   |   |   |   |   |--- no_of_weekend_nights <= 1.50
|   |   |   |   |   |   |   |   |   |   |--- lead_time <= 166.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [2.24, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |   |--- lead_time >  166.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [2.24, 57.69] class: 1
|   |   |   |   |   |   |   |   |   |--- no_of_weekend_nights >  1.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [17.89, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- no_of_week_nights >  1.50
|   |   |   |   |   |   |   |   |   |--- no_of_weekend_nights <= 0.50
|   |   |   |   |   |   |   |   |   |   |--- arrival_month <= 9.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [11.18, 3.04] class: 0
|   |   |   |   |   |   |   |   |   |   |--- arrival_month >  9.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 12.14] class: 1
|   |   |   |   |   |   |   |   |   |--- no_of_weekend_nights >  0.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [75.30, 12.14] class: 0
|   |   |   |   |   |   |   |--- lead_time >  244.00
|   |   |   |   |   |   |   |   |--- arrival_year <= 2017.50
|   |   |   |   |   |   |   |   |   |--- weights: [25.35, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- arrival_year >  2017.50
|   |   |   |   |   |   |   |   |   |--- avg_price_per_room <= 80.38
|   |   |   |   |   |   |   |   |   |   |--- no_of_week_nights <= 3.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [11.18, 264.15] class: 1
|   |   |   |   |   |   |   |   |   |   |--- no_of_week_nights >  3.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 3
|   |   |   |   |   |   |   |   |   |--- avg_price_per_room >  80.38
|   |   |   |   |   |   |   |   |   |   |--- weights: [7.46, 0.00] class: 0
|   |   |   |   |   |   |--- arrival_month >  11.50
|   |   |   |   |   |   |   |--- weights: [46.22, 0.00] class: 0
|   |   |   |   |--- avg_price_per_room >  82.47
|   |   |   |   |   |--- no_of_adults <= 2.50
|   |   |   |   |   |   |--- lead_time <= 324.50
|   |   |   |   |   |   |   |--- arrival_month <= 11.50
|   |   |   |   |   |   |   |   |--- room_type_reserved_Room_Type 4 <= 0.50
|   |   |   |   |   |   |   |   |   |--- weights: [7.46, 986.78] class: 1
|   |   |   |   |   |   |   |   |--- room_type_reserved_Room_Type 4 >  0.50
|   |   |   |   |   |   |   |   |   |--- market_segment_type_Offline <= 0.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 10.63] class: 1
|   |   |   |   |   |   |   |   |   |--- market_segment_type_Offline >  0.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [4.47, 0.00] class: 0
|   |   |   |   |   |   |   |--- arrival_month >  11.50
|   |   |   |   |   |   |   |   |--- market_segment_type_Offline <= 0.50
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 19.74] class: 1
|   |   |   |   |   |   |   |   |--- market_segment_type_Offline >  0.50
|   |   |   |   |   |   |   |   |   |--- weights: [5.22, 0.00] class: 0
|   |   |   |   |   |   |--- lead_time >  324.50
|   |   |   |   |   |   |   |--- no_of_weekend_nights <= 1.50
|   |   |   |   |   |   |   |   |--- weights: [0.75, 13.66] class: 1
|   |   |   |   |   |   |   |--- no_of_weekend_nights >  1.50
|   |   |   |   |   |   |   |   |--- weights: [5.96, 0.00] class: 0
|   |   |   |   |   |--- no_of_adults >  2.50
|   |   |   |   |   |   |--- weights: [5.22, 0.00] class: 0
|   |   |--- no_of_special_requests >  0.50
|   |   |   |--- no_of_weekend_nights <= 0.50
|   |   |   |   |--- lead_time <= 180.50
|   |   |   |   |   |--- lead_time <= 159.50
|   |   |   |   |   |   |--- arrival_month <= 8.50
|   |   |   |   |   |   |   |--- weights: [5.96, 0.00] class: 0
|   |   |   |   |   |   |--- arrival_month >  8.50
|   |   |   |   |   |   |   |--- weights: [1.49, 7.59] class: 1
|   |   |   |   |   |--- lead_time >  159.50
|   |   |   |   |   |   |--- arrival_date <= 1.50
|   |   |   |   |   |   |   |--- weights: [1.49, 3.04] class: 1
|   |   |   |   |   |   |--- arrival_date >  1.50
|   |   |   |   |   |   |   |--- weights: [35.79, 1.52] class: 0
|   |   |   |   |--- lead_time >  180.50
|   |   |   |   |   |--- no_of_special_requests <= 2.50
|   |   |   |   |   |   |--- market_segment_type_Online <= 0.50
|   |   |   |   |   |   |   |--- no_of_adults <= 2.50
|   |   |   |   |   |   |   |   |--- weights: [12.67, 3.04] class: 0
|   |   |   |   |   |   |   |--- no_of_adults >  2.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 3.04] class: 1
|   |   |   |   |   |   |--- market_segment_type_Online >  0.50
|   |   |   |   |   |   |   |--- weights: [7.46, 206.46] class: 1
|   |   |   |   |   |--- no_of_special_requests >  2.50
|   |   |   |   |   |   |--- weights: [8.95, 0.00] class: 0
|   |   |   |--- no_of_weekend_nights >  0.50
|   |   |   |   |--- market_segment_type_Offline <= 0.50
|   |   |   |   |   |--- arrival_month <= 11.50
|   |   |   |   |   |   |--- avg_price_per_room <= 76.48
|   |   |   |   |   |   |   |--- weights: [46.97, 4.55] class: 0
|   |   |   |   |   |   |--- avg_price_per_room >  76.48
|   |   |   |   |   |   |   |--- arrival_date <= 27.50
|   |   |   |   |   |   |   |   |--- no_of_week_nights <= 5.50
|   |   |   |   |   |   |   |   |   |--- lead_time <= 233.00
|   |   |   |   |   |   |   |   |   |   |--- lead_time <= 152.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [1.49, 4.55] class: 1
|   |   |   |   |   |   |   |   |   |   |--- lead_time >  152.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 3
|   |   |   |   |   |   |   |   |   |--- lead_time >  233.00
|   |   |   |   |   |   |   |   |   |   |--- weights: [23.11, 19.74] class: 0
|   |   |   |   |   |   |   |   |--- no_of_week_nights >  5.50
|   |   |   |   |   |   |   |   |   |--- weights: [8.95, 16.70] class: 1
|   |   |   |   |   |   |   |--- arrival_date >  27.50
|   |   |   |   |   |   |   |   |--- no_of_week_nights <= 1.50
|   |   |   |   |   |   |   |   |   |--- weights: [2.24, 15.18] class: 1
|   |   |   |   |   |   |   |   |--- no_of_week_nights >  1.50
|   |   |   |   |   |   |   |   |   |--- lead_time <= 269.00
|   |   |   |   |   |   |   |   |   |   |--- lead_time <= 176.00
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [2.24, 7.59] class: 1
|   |   |   |   |   |   |   |   |   |   |--- lead_time >  176.00
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 2
|   |   |   |   |   |   |   |   |   |--- lead_time >  269.00
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 4.55] class: 1
|   |   |   |   |   |--- arrival_month >  11.50
|   |   |   |   |   |   |--- arrival_date <= 14.50
|   |   |   |   |   |   |   |--- weights: [8.20, 3.04] class: 0
|   |   |   |   |   |   |--- arrival_date >  14.50
|   |   |   |   |   |   |   |--- weights: [11.18, 31.88] class: 1
|   |   |   |   |--- market_segment_type_Offline >  0.50
|   |   |   |   |   |--- lead_time <= 348.50
|   |   |   |   |   |   |--- weights: [106.61, 3.04] class: 0
|   |   |   |   |   |--- lead_time >  348.50
|   |   |   |   |   |   |--- weights: [5.96, 4.55] class: 0
|   |--- avg_price_per_room >  100.04
|   |   |--- arrival_month <= 11.50
|   |   |   |--- no_of_special_requests <= 2.50
|   |   |   |   |--- weights: [0.00, 3200.19] class: 1
|   |   |   |--- no_of_special_requests >  2.50
|   |   |   |   |--- weights: [23.11, 0.00] class: 0
|   |   |--- arrival_month >  11.50
|   |   |   |--- no_of_special_requests <= 0.50
|   |   |   |   |--- weights: [35.04, 0.00] class: 0
|   |   |   |--- no_of_special_requests >  0.50
|   |   |   |   |--- arrival_date <= 24.50
|   |   |   |   |   |--- weights: [3.73, 0.00] class: 0
|   |   |   |   |--- arrival_date >  24.50
|   |   |   |   |   |--- weights: [3.73, 22.77] class: 1

In [144]:
importances = best_model.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()

Comparing Decision Tree models¶

In [141]:
# training performance comparison

models_train_comp_df = pd.concat(
    [
        tree1_perf_train.T,
        tree2_perf_train.T,
        tree3_perf_train.T,
    ],
    axis=1,
)
models_train_comp_df.columns = [
    "Decision Tree (Initial)",
    "Decision Tree (Pre-Pruning)",
    "Decision Tree (Post-Pruning)",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
Out[141]:
Decision Tree (Initial) Decision Tree (Pre-Pruning) Decision Tree (Post-Pruning)
Accuracy 0.99421 0.83101 0.89946
Recall 0.98661 0.78620 0.90231
Precision 0.99578 0.72428 0.81297
F1 0.99117 0.75397 0.85531
In [142]:
# testing performance comparison

models_train_comp_df = pd.concat(
    [
        tree1_perf_test.T,
        tree2_perf_test.T,
        tree3_perf_test.T,
    ],
    axis=1,
)
models_train_comp_df.columns = [
    "Decision Tree (Initial)",
    "Decision Tree (Pre-Pruning)",
    "Decision Tree (Post-Pruning)",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
Out[142]:
Decision Tree (Initial) Decision Tree (Pre-Pruning) Decision Tree (Post-Pruning)
Accuracy 0.99421 0.83497 0.86925
Recall 0.98661 0.78336 0.85548
Precision 0.99578 0.72758 0.76725
F1 0.99117 0.75444 0.80897

The post-pruning tree performs the best on all four metrics as compared to the pre-pruning tree.

All of these models are pretty tough to interpret compared to models explored in class. However, I am getting very good at reading the text files. Here is a depth = 3 version of our strongest tree, the post-pruning tree:

In [143]:
plt.figure(figsize=(20, 10))

out = tree.plot_tree(
    best_model,
    feature_names=feature_names,
    filled=True,
    fontsize=9,
    node_ids=False,
    class_names=None,
    max_depth=3   #friendly viewing, depth = 3
)
for o in out:
    arrow = o.arrow_patch
    if arrow is not None:
        arrow.set_edgecolor("black")
        arrow.set_linewidth(1)
plt.show()

This top-level view allows us to make a few observations and buisness decisions:

Conclusions¶

Business Suggestions

  • Having a large lead time may be a convenience for travelors to ensure their logistics well in advance of their stay - this is particularly true for families juggling work and school schedules. However, the cancelation rate is very high across both models when lead times are more than about 5 months out.
    • Would it be possible to change the reservation system for lead times of greater than 5 months out that is less specific; more like a commitment to travel in a window and a offline conversation occcurs 5-6 months out with the customer?
    • If hotels are able to contact customers when they book through online services, having a dedicated team of customer service representatives that contact customers in that 5-6 month window - this addresses two different features in the models. It is worth mentioning this may be an expensive suggestion - a SME would best know how feasible an increase in customer service reach out is for the volume of bookings.
    • Included details could be month, number of adults, length of stay, and special requests. Excluded details could be the specific dates, including the count of weekend and weekdays, specific room type, food package, parking needs, children attending.
    • The hotel could then use these models to hold a certain percentage of their rooms for those customers until the 5-6 months time occurs.
  • What is the balance of special requests?
    • If a specific hotel has the ability to accomodate special requests, it seems as this increases buy in an continued booking. If the facility has the ability to have a list of 10 common requests and their options for customers to chose when booking, this may decrease cancellations.
    • Follow up would be encouraged; emails, texts, or mailers that highlight the specific requests. As an example; I personally require accessible rooms when I travel. When I get updates from a hotel about the accessibility of my booking I get more excited, and feel valued.
  • Rewards Programs: what systems does the hotel already have for complementary stays?
    • Could more "ammenities" be marketed as "complementary" for specific customers? This may decrease cancelations if customers feel as though they are loosing "complementary upgrades" when the cancel or reschedule.
  • Those were positive encouragments; now for a punishment suggestion. Could the hotels consider a firmed cancellation policy in summer months?
    • This was not as large of a factor, but it was noticible. Increasing fines for summer cancellations may help the buisness recoup losses and lower the risk of the models.


Data Improvements

  • There seems to be a missing category of data - and that is how many stays are rebookings. I appreciate that the target variable needed to stay a binary, but there could have been another category of previous bookings: not cancelled, cancelled, and rebooked.

  • If any of the hotels host major events, or see spikes in volume of bookings due to surronding city events, this could be an additional categorical variable. Was the booking assosiated/correlated with a special event?

Logistic Regression Model Commentary

  • The threshold value of 0.38 optimized the F1 value
  • The F1 value was chosen as the metric to prioritize so that there would be a balance between False-Negatives and False-Positives.
  • However, if a SME wanted to strike a better balance with Precision to avoid the dreaded overbooking that is known to plague the airline industry, the Threshold of 0.42 could also be selected.
  • The F1 value of the final Logistic Regression Model was 0.70 on the test data.
  • The Logistic Regression identified these as the most important factors in cancellations:
    • Complementary Bookings do not cancel
    • Booking Offline
    • Car Parking Space (small % of bookings)
    • Special Requests (small % of bookings)
    • Room Type 7 (small % of bookings)

Decision Tree Model Commentary

  • The Post-pruning decision tree optimized all performance metrics, while blancing complexity.
  • The F1 value was again chosen as the metric to prioritize so that there would be a balance between False-Negatives and False-Positives.
  • The F1 value of the final Decision Tree Model was 0.81 on the test data.
  • The Decision Tree identified these as the most important factors in cancellations:
    • Lead TIme
    • Booking Online
    • Average Price Per Room
    • Special Requests
    • Arrival Month